Ever since the first genome-wide association study (GWAS) on age-related macular degeneration, and the promise of personalized medicine in the wake of the Human Genome Project, large-scale genetic association studies hold significant sway in contemporary health research and drive drug-development pipelines. In the past 2 decades, researchers delved into GWAS, aiming to unveil genetic variations linked to both human traits, such as the color of your eyes, and rare and common complex diseases. These findings serve as crucial keys to unravel the intricate mechanisms underlying diseases, shedding light on whether the correlations identified in observational studies between risk factors and diseases are truly causal.
These studies have ushered in an exciting era where many researchers thrive on developing new methods and bioinformatic tools to parse ever-growing large datasets collected large population-based biobanks. However, the analyses of these data are challenging and it can be daunting to see the forest for tree among the many tools and their various functions. Enter A Practical Primer in Human Complex Genetics. This GitBook was originally written back in 2022 for the Genetic Epidemiology course organized by the Master Epidemiology of Utrecht University. This practical primer will teach you how to design a GWAS, perform quality control (QC), execute the actual analyses, annotate the GWAS results, and perform further downstream post-GWAS analyses. Throughout the book you’ll work with ‘dummy’, that is fake, data, but in the end, we will use real-world data from the first release of the Welcome Trust Case-Control Consortium (WTCCC) focusing on coronary artery disease (CAD).
A major component of modern-day GWAS is genetic imputation, but for practical reasons it is not part of this book. However, I will provide some pointers as to how to go about do this with minimal coding or scripting experience. Likewise, the courses does not cover the aspects of meta-analyses of GWAS, but some excellent resources exist to which I will direct. As this practical primer evolves, these and other topics may find their place in this book. I should also point out that emphasis of this book is on it being a practical primer. It is intended to provide some practical guidance to doing GWAS, and while theory is important, I will not cover this. Again, some very useful and excellent work exists to which I will point you, but I really want you to learn - and understand the theory - by doing.
So, although originally crafted as a companion for the course, this practical primer stands on its own as a comprehensive resource for diving into all facets of doing a GWAS — save for experimental follow-up, of course 😉.
I can imagine this seems overwhelming, but trust me, you’ll be okay. Just follow this practical. You’ll learn by doing and at the end of the day, you can execute a GWAS independently.
Ready to start?
Your first point of action is to prepare your system for this course in Chapter 2.
Standing on the shoulders of giants, that’s what this book and I do. I want to acknowledge some great work that has helped me tremendously and, really, this book wouldn’t exist without this awesome work. So, I do want to give you some background reading. Is it a prerequisite? No, not really. For starters, the course covers most and you’ll learn as you go. And if you didn’t come here through the course, you’ll be fine just the same. That said, it’s a always good idea to get familiar with these works as you move forward on your path towards your first GWAS - in fact, I had these printed out with markings and writings all over them as I executed my first GWAS, and they’ve been great as a reference many times after.
Large parts of this work are based on four awesome Nature Protocols from the Zondervan group at the Wellcome Center Human Genetics.
An update on the community standards of QC for GWAS can be found here:
With respect to imputation and meta-analyses of GWAS you should also get familiar with the following two works:
Are you ready?
Are you ready? Did you bring coffee and a good dose of energy? Let’s start in Chapter 3.
Before getting started, we need to discuss computers, specifically your computer. Most programs made to execute genetic epidemiology studies are developed for the Unix environment, for example Linux and macOS. Years ago - now I sound really old - when I just started out on my journey in genetic epidemiology, I had a fancy (no, really) Windows-laptop. So, stubborn as I was (am), I tried to make this work (I remember Jessica - see Epilogue 18 - warning against it) - I lost 3 months of my life. Then a friend of mine lend me his old MacBook - you know, really old, the beautiful white MacBook - and I was stunned. Everything worked. Instantly. Even on that old little white gem which was running OS X Leopard. I was sold. Until then I never understood why all those ‘gen epi’ people were always working on a Mac - I thought it was just a fashion statement. But, it is not. It is a necessity in our line of work.
So, take my advice, don’t waste your time. You think a MacBook is expensive? Time is the most precious commodity that we have - when it’s gone, it’s gone - you’ll never get it back. Money? You can make more. Invest in a MacBook, or get your employer to do so. Get a refurbished one - there are many great options out there. Everything in this practical primer you can do with the MacBook Air (from the 2010 generation and up) with 8Gb RAM and at least 256Gb hard drive. A refurbished MacBook would work just fine, you really don’t need the latest model.
Besides, those ‘gen epi’ programs may not work as intended in a Windows environment. Sure, Windows does allow users to install a linux subsystem within Windows 10+ and you can find the detail guide here. And you can create a virtual machine with Ubuntu installed) on your Windows-computer. But honestly, I wouldn’t bother.
This is not some elevator pitch to buy MacBook - in all other cases you’ll probably be fine without. But trust me, ‘gen epi’, and all other (academic) ‘office’ stuff you need to do, work like a charm on macOS.
For this practical every command is intended for Linux/macOS, in other words Unix-systems.
Next, I’ll briefly provide some background on the various programs that are commonly used Chapter 4.
We’ll use a few programs throughout this practical. You’ll probably need these for your (future) genetic epidemiology work too (Table 4.1).
Program | Link | Description |
|---|---|---|
PLINK | https://www.cog-genomics.org/plink2/ | PLINK is a free, open-source genetic analysis tool set, designed to perform a range of basic data parsing and quality control, as well as basic and large-scale analyses in a computationally efficient manner. |
R | https://cran.r-project.org/ | A program to perform statistical analysis and visualizations. |
RStudio | https://www.rstudio.com | A user-friendly R-wrap-around for code editing, debugging, analyses, and visualization. |
Homebrew | https://brew.sh | A great extension for Mac-users to install really useful programs that Apple didn't. |
RStudio is a very user-friendly interface around R that makes your R-scripting-life a lot easier. You should get used to that. RStudio comes with R so you don’t have to worry about that.
Right, onto PLINK.
All genetic analyses can be done in PLINK, even on your laptop, but with large datasets, for example UK Biobank size, it is better to switch to a high-performance computing cluster (HPC) like we have available at the Utrecht Science Park. The original PLINK v1.07 can be found here, but nowadays we are using a newer, faster version: PLINK v1.9 which can be found here. It still says ‘PLINK 1.90 beta’ (Figure 4.1), but you can consider this version stable and save to work with, but as you can see, some functions are not supported anymore.
Figure 4.1: The PLINK v1.9 website.
PLINKNowadays, a lot of people also use programs like SNPTEST, BOLT-LMM, GCTA, or regenie as alternatives to execute GWAS. These programs were designed with specific use-cases in mind, for instance really large biobank data including hundreds of thousands individuals, better control for population stratification, the ability to estimate trait heritability or Fst, and so on.
Mendelian randomization can be done either with the SMR or GSMR function from GCTA, or with R-packages, like TwoSampleMR.
As you are following the Genetic Epidemiology course, the next thing we’ll cover are the CoCalc-instructions in Chapter 5.
Now, pay attention. You came here through the course Genetic Epidemiology, which means you don’t have to do anything. All the data you need are already downloaded and everything was set up for on the server, just follow the link as provided by the course-instructors.
Here we provide a few short instructions to navigate CoCalc.
Once logged in you should see a screen similar to the one below.
Figure 5.1: CoCalc after logging in.
Navigate to the Course by clicking on the blue Your UBC CoCalc 1 projects…-button and selecting the course Genetic Epidemiology you are following.
Figure 5.2: CoCalc projects.
Once you are in the course, you will see a screen like this one below.
Figure 5.3: CoCalc student page.
In the Course Book you’ll find the handouts you need. In the share data folder you will find the data you’ll need.
CoCalc is an online platform that allows you to work with Jupyter Notebook, LaTeX, and other tools. It is a great tool for working with data and code, and it is especially useful for collaborative work.
Jupyter Notebook is an interactive environment for writing and running code, and at the same time, it allows you to add text, images, and other content to your document. This makes it a great tool for creating reports, tutorials, and presentations. Jupyter Notebook has support for over 40 programming languages, including those popular in Bioinformatics such as Python and R.
A Jupyter Notebook is made up of cells. Some contain instructions, others contain lines of code, while others combine both (like this one). You can run each cell by selecting it and pressing Shift+Enter. The code in the cell will be executed, and the output will be displayed below the cell. A cell can be a ‘code’ cell, a markdown cell, or a raw ‘text’ cell.
R or Python;markdown is for markdown text; you can learn more about markdown here.Each notebook is associated with a single operating system called a kernel. A kernel is a program that runs and interprets the code you write in a notebook. It is responsible for executing the code and returning the results. Each kernel has its own environment and programming language, as well as its own set of libraries and packages. Luckily, for the purpose of this practical everything you need is installed.
Working with a Jupyter Notebook is easy. Here are a few tips to get you started:
Sometimes somethings go awry for not apparent reason. Here are a few tips to get you back on track:
The great thing is, that after a freeze and restart, you can continue from where you were. But always first go to the right directory again since the kernel doesn’t remember your previously executed commands.
In the course we use CoCalc as we can precisely control what is installed and what is not. And you don’t have to worry about installing software or libraries. Another advantage is that you can work on the practical from anywhere, as long as you have an internet connection. And lastly, all the data you need is already there, so you don’t have to worry about downloading it and we save a lot of space since you will all be working from the same source data.
This practical is associated with both the R and bash programming language. We use plink, which is a Linux-program in the Terminal (see ??) and uses bash, so plink needs a Bash kernel. The data from the plink analyses are parsed and plotted using R code, and so we also need a R kernel.
CoCalc has both kernels installed, so you could run both bash and R code in the same notebook. But the command to make this work is a bit complicated, so we decided to split the bash and R code into a virtual Terminal for the bash codes and a Jupyter Notebook with a R kernel for the R code.
This means you have to create a new notebook for the
Rcode. And a new Terminal for thebashcode. In the next section we explain how you can do this.
R kernelFor the course we need to create a Jupyter Notebook with a R kernel to create all the plots. To start your own notebook, you can do so by clicking on the New button and selecting Jupyter Notebook.
(#fig:cocalc_notebook)CoCalc new notebook.
You will probably get a screen asking you to select a kernel. You should choose R.
Figure 5.4: CoCalc kernel selection.
Now you are ready to start your own R Notebook - like below. You can create a new ‘cell’ with format ‘code’ or ‘text’ or ‘markdown’ and start typing.
Figure 5.5: CoCalc starting your notebook.
You could get your working directory by typing getwd() in a cell and pressing Shift+Enter.
Figure 5.6: CoCalc get working directory.
Or you could list the contents of the working directory by typing list.files() or dir() in a cell and pressing Shift+Enter.
Figure 5.7: CoCalc list files.
You could also create a dummy plot with R in this notebook.
Figure 5.8: CoCalc plot.
bash codeFor the course we need to create a Terminal to run bash code. To start your own Terminal, you can do so by clicking on the New button and selecting Terminal.
Figure 5.9: CoCalc new terminal.
Type in ‘My Terminal’ and hit the Create button.
Figure 5.10: CoCalc new terminal creation.
A new tab will open with a Terminal.
Figure 5.11: CoCalc terminal window.
Now you can start typing your bash commands. Let’s see what we have in the directory you’re at by typing ls. Check Chapter ?? for more background on some common Terminal commands.
Figure 5.12: CoCalc listing the contents of the directory.
Let’s review what we see here. First of all there are two colors, blue and black. The blue ones are directories, the black ones are files. The directories are:
Course Book, contains the handouts;data, contains the data you need for the course;The files are:
My Notebook.ipynb, the Jupyter Notebook with a R kernel you just created;My Terminal, the terminal you just created which can handle the bash language and where you can run plink.We can also list the contents of the data directory by typing ls ~/data/shared/ in the Terminal. The ~ meanse ‘home’, or which could also be written as $HOME.
Figure 5.13: CoCalc terminal window.
Let’s check out the contents of the readme.txt file.
cat ~/data/shared/readme.txt
Figure 5.14: CoCalc readme file.
As I wrote, we installed plink for you, but just to be sure, let’s check if it is installed. Type plink in the Terminal and press ENTER.
Figure 5.15: CoCalc plink.
This book is designed to be used in conjunction with bash and R and it was tested on macOS and Linux. Since you’re working in CoCalc you always start in ‘home’, that is ~ - you can navigate there with cd ~. All the data are relative to ‘home’, so they are here: ~/data/shared.
This means that all the commands in this book are relative to ‘home’ too. Throughout the book you will find plink commands that you can run in the Terminal. You can copy these commands and paste them in the Terminal you just created. And the R commands work in the Jupyter Notebook you just created.
Are you ready? Did you bring coffee and a good dose of energy? Let’s start!
Oh, one more thing: you don’t worry about saving your notebook or the Terminal, the ones you just created, CoCalc will automatically save these.
Ok. ’Nough said, let’s move on to cover some basics in Chapter 6.
Now that you understand a bit of the navigation in Unix-systems, you’re ready to start this practical primer. We will make use of a dummy dataset containing cases and controls. We will explain and execute the following steps:
The format in which genotype data are returned to investigators varies among genome-wide SNP platforms and genotyping centers. Usually genotypes have been called by a genotyping center and returned in the standard PED and MAP file formats designed for PLINK.
A PED file is a white space (space or tab)-delimited file in which each line represents one individual and the first six columns are mandatory and in the following order:
The subsequent columns denote genotypes that can be any character (e.g., 1, 2, 3, 4 or A, C, G, T). Zero denotes a missing genotype. Each SNP must have two alleles (i.e., both alleles are either present or absent). The order of SNPs in the PED file is given in the MAP file, in which each line denotes a single marker and the four white-space–separated columns are chromosome (1–22, X, Y or 0 for unplaced), marker name (typically an rs number), genetic distance in Morgans (this can be fixed to 0) and base-pair position (bp units).
Let’s start by using PLINK to converting the datasets to a lighter, binary form (a .bed-file). This file saves data in a more memory- and time-efficient manner (in a ‘binary’-format) to facilitate the analysis of large-scale data sets (Purcell S. 2007). The marker-information is stored in the .bim-file and the family information in the .fam-file. PLINK creates a .log file (named raw-GWA-data.log) that details (among other information) the implemented commands, the number of cases and controls in the input files, any excluded data and the genotyping rate in the remaining data. This file is very useful for checking whether the software is successfully completing commands.
Make sure you are in the right directory. Do you remember how to get there?
In the Terminal window you type the following:
cd $HOME
You know that you’re ‘home’ when you type pwd and see path with a lot of numbers and letters like this /projects/e66af6fa-60ee-4448-972f-daaf235ca9da - yours will be a different sequence of numbers and letters.
Start by creating a project directory.
mkdir -v dummy_project
As I mentioned, all the data is stored in the ~/data/shared directory. So, if you do the following:
ls ~/data/shared
You will see this:
dummy_project gwas rawdata readme.txt readme.wtccc1.txt ref_1kg_phase1_all scripts support wtccc1
(in blue the directories, in black the files)
dummy_project in ~/data/shared contains all the results from this practical primer. You’re not going to use that, if all is well, because you’re going to do it yourself in the following Chapters.gwas directory contains the data from the dummy GWAS we will perform.rawdata directory contains the dummy data we will use.ref_1kg_phase1_all directory contains the reference data from the 1000 Genomes Project.wtccc1 directory contains the WTCCC1 data.support directory contains some additional files that are used in the practical primer.scripts directory contains the scripts we will use.readme.txt and readme.wtccc1.txt files contain some additional information.Let’s convert the .ped/.map files to the binary-format.
plink --file ~/data/shared/rawdata/raw-GWA-data --make-bed --out dummy_project/rawdata
Let’s review the .log-file for a bit. It should look something like this:
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to dummy_project/rawdata.log.
Options in effect:
--file rawdata/raw-GWA-data
--make-bed
--out dummy_project/rawdata
16384 MB RAM detected; reserving 8192 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (317503 variants, 2000 people).
--file: dummy_project/rawdata-temporary.bed +
dummy_project/rawdata-temporary.bim + dummy_project/rawdata-temporary.fam
written.
317503 variants loaded from .bim file.
2000 people (997 males, 1003 females) loaded from .fam.
2000 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2000 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 11440 het. haploid genotypes present (see dummy_project/rawdata.hh );
many commands treat these as missing.
Total genotyping rate is 0.985682.
317503 variants and 2000 people pass filters and QC.
Among remaining phenotypes, 1023 are cases and 977 are controls.
--make-bed to dummy_project/rawdata.bed + dummy_project/rawdata.bim +
dummy_project/rawdata.fam ... done.
So, there are 317,503 variants included for 2,000 people, 997 males and 1,003 females. All of these individuals are ‘founders’. There are 1,023 cases and 977 controls. The genotyping rate is about 98.6% which is pretty good. Lastly, there are 11,440 heterozygous haploid genotypes present.
Question: Can you think off what the ‘11,440 heterozygous haploid genotypes present’ represent?
We are ready for some quality control and quality assurance, heavily inspired by Anderson et al. (Anderson C. A. 2010) and Laurie et al. (Laurie C. C. 2010). In general, we should check out a couple of things regarding the data quality on two levels:
So, we will investigate the following:
Question: Can you think of other scenarios in which you may want to extend the check on differential missingness beyond a check between cases and controls?
All clear? Let’s start the work. On to step 1 of the QC for GWAS: filter samples of poor quality in Chapter 7.
Let’s start with the per-sample quality control. The goal here is to identify all the samples that for some reason (mentioned in the previous section) should be excluded.
We need to identify of individuals with discordant sex information comparing phenotypic and genotypic data. Let’s calculate the mean homozygosity rate across X-chromosome markers for each individual in the study.
plink --bfile dummy_project/rawdata --check-sex --out dummy_project/rawdata
This produces a file with the following columns:
We need to get a list of individuals with discordant sex data.
cat dummy_project/rawdata.sexcheck | awk '$5 =="STATUS" || $5 =="PROBLEM"' > dummy_project/rawdata.sexprobs.txt
Let’s have a look at the results in R in the Terminal.
cat dummy_project/rawdata.sexprobs.txt
sexissues <- data.table::fread("dummy_project/rawdata.sexprobs.txt")
FID | IID | PEDSEX | SNPSEX | STATUS | F |
|---|---|---|---|---|---|
772 | 772 | 2 | 0 | PROBLEM | 0.3084 |
853 | 853 | 2 | 0 | PROBLEM | 0.3666 |
1,920 | 1,920 | 2 | 0 | PROBLEM | 0.4066 |
When the homozygosity rate (F) is more than 0.2, but less than 0.8, the genotype data are inconclusive regarding the sex of an individual and these are marked in column SNPSEX with a 0, and the column STATUS “PROBLEM”.
Report the IDs of individuals with discordant sex information (Table 7.1) to those who conducted sex phenotyping. In situations in which discrepancy cannot be resolved, add the family ID (FID) and individual ID (IID) of the samples to a file named fail-sexcheck-qc.txt (one individual per line, tab delimited).
grep "PROBLEM" dummy_project/rawdata.sexcheck | awk '{ print $1, $2}' > dummy_project/fail-sexcheck-qc.txt
Let’s get an overview of the missing data per sample and per SNP.
plink --bfile dummy_project/rawdata --missing --out dummy_project/rawdata
This produces two files, rawdata/rawdata.imiss and rawdata/rawdata.lmiss. In the .imiss-file the N_MISS column denotes the number of missing SNPs, and the F_MISS column denotes the proportion of missing SNPs per individual.
raw_IMISS <- data.table::fread("dummy_project/rawdata.imiss")
raw_IMISS$callrate <- 1 - raw_IMISS$F_MISS
Now you can plot these missing genotypes per samples, the ‘sample call rate’.
library("ggpubr")
ggpubr::gghistogram(raw_IMISS, x = "callrate",
add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
rug = TRUE, bins = 50,
color = "#1290D9", fill = "#1290D9",
xlab = "per sample call rate") +
ggplot2::geom_vline(xintercept = 0.95, linetype = "dashed",
color = "#E55738", size = 1)
ggplot2::ggsave("dummy_project/gwas-qc-sample-call-rate.png",
plot = last_plot())
The grey dashed line in Figure 7.1 indicates the mean call rate, while the red dashed line indicates the threshold we had determined above.
knitr::include_graphics("img/gwas_dummy/gwas-qc-sample-callrate.png")
Figure 7.1: Per sample call rate.
To properly calculate heterozygosity rate and relatedness (identity-by-descent [IBD]) we need to do four things:
You can find an up-to-date list of LD blocks you should exclude in these types of analyses here for the different genome builds. In this case we are using build 37. For the purpose of this book we included a file with these regions in the support-directory.
We will use the following settings:
--exclude dummy_project/all.atcg.variants.txt,--geno 0.10,--hwe 1e-3,--maf 0.10,r^2 = 0.2 with the flags --indep-pairwise 100 10 0.2 and --extract rawdata/raw-GWA-data.prune.in,--exclude range support/exclude_problematic_range.txt,--autosome.First, get a list of A/T and C/G SNPs. Remember, the list of markers for this GWAS is noted in the .bim file. We can simply grep all the lines where the two alleles either have an A/T or C/G combination.
cat dummy_project/rawdata.bim | \
awk '($5 == "A" && $6 == "T") || ($5 == "T" && $6 == "A") || ($5 == "C" && $6 == "G") || ($5 == "G" && $6 == "C")' | awk '{ print $2, $1, $4, $3, $5, $6 }' \
> dummy_project/all.atcg.variants.txt
Second, clean the data and get a list of independent SNPs.
plink --bfile dummy_project/rawdata \
--autosome \
--maf 0.10 --geno 0.10 --hwe 1e-3 \
--indep-pairwise 100 10 0.2 \
--exclude range ~/data/shared/support/exclude_problematic_range.txt \
--make-bed --out dummy_project/rawdata.clean.temp
Please note, we have create a dataset without taking into account LD structure. Hence, the message ‘Pruned 0 variants from chromosome 1, leaving 19420.’ etc. In a dataset without any LD structure this flag
--indep-pairwise 100 10 0.2doesn’t actually work. However, with real-data you can use it to prune out unwanted SNPs in high LD.
Third, exclude the pruned SNPs. Note, how we include a file to exclude high-LD for the purpose of the practical.
plink --bfile dummy_project/rawdata.clean.temp \
--extract ~/data/shared/rawdata/raw-GWA-data.prune.in \
--make-bed --out dummy_project/rawdata.clean.ultraclean.temp
Fourth, remove the A/T and C/G SNPs.
plink --bfile dummy_project/rawdata.clean.ultraclean.temp \
--exclude dummy_project/all.atcg.variants.txt \
--make-bed --out dummy_project/rawdata.clean.ultraclean
Please note, this dataset doesn’t actually include this type of SNP, hence
rawdata/all.atcg.variants.txtis empty! Again, you can use this command in real-data to exclude A/T and C/G SNPs.
Lastly, remove the temporary files.
rm -fv dummy_project/*.temp*
Finally, we can calculate the heterozygosity rate.
plink --bfile dummy_project/rawdata.clean.ultraclean --het --out dummy_project/rawdata.clean.ultraclean
This creates the file dummy_project/rawdata.clean.ultraclean.het, in which the third column denotes the observed number of homozygous genotypes, O(Hom), and the fifth column denotes the number of nonmissing genotypes, N(NM), per individual. We can now calculate the observed heterozygosity rate per individual using the formula (N(NM) - O(Hom))/N(NM).
Often there is a correlation between heterozygosity rate and missing data. Thus, we should plot the observed heterozygosity rate per individual on the x-axis and the proportion of missing SNP, that is the ‘SNP call rate’, per individuals on the y-axis (Figure 7.2).
raw_HET <- data.table::fread("dummy_project/rawdata.clean.ultraclean.het")
raw_IMISS$logF_MISS = log10(raw_IMISS$F_MISS)
prop_miss = -1.522879
raw_HET$meanHet = (raw_HET$`N(NM)` - raw_HET$`O(HOM)`)/raw_HET$`N(NM)`
lower_meanHet = mean(raw_HET$meanHet) - (3*sd(raw_HET$meanHet))
upper_meanHet = mean(raw_HET$meanHet) + (3*sd(raw_HET$meanHet))
raw_IMISSHET = merge(raw_IMISS, raw_HET, by = "IID")
raw_IMISSHET$FID.y <- NULL
colnames(raw_IMISSHET)[colnames(raw_IMISSHET)=="FID.x"] <- "FID"
colors <- densCols(raw_IMISSHET$logF_MISS, raw_IMISSHET$meanHet)
library("geneplotter")
# you can also use the following code to create the plot
# png("dummy_project/gwas-qc-imiss-vs-het.png")
plot(raw_IMISSHET$logF_MISS,
raw_IMISSHET$meanHet,
col = colors,
xlim = c(-3,0), ylim = c(0,0.5),
pch = 20,
xlab = "Proportion of missing genotypes", ylab = "Heterozygosity rate", axes = FALSE)
axis(2, at = c(0,0.05,0.10,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5), tick = TRUE)
axis(1, at = c(-3,-2,-1,0),
labels = c(0.001,0.01,0.1,1))
abline(h = lower_meanHet, col = "#E55738", lty = 2)
abline(h = upper_meanHet, col = "#E55738", lty = 2)
abline(v = prop_miss, col = "#E55738", lty = 2)
knitr::include_graphics("img/gwas_dummy/show-heterozygosity.png")
Figure 7.2: Heterozygosity as a function of SNP call rate.
Examine the plot (Figure 7.2) to decide reasonable thresholds at which to exclude individuals based on elevated missing or extreme heterozygosity. We chose to exclude all individuals with a genotype failure rate >= 0.03 (vertical dashed line) and/or a heterozygosity rate ± 3 s.d. from the mean (horizontal dashed lines). Add the FID and IID of the samples failing this QC to the file named fail-imisshet-qc.txt.
How would you create this file?
raw_IMISSHETsub = subset(raw_IMISSHET, logF_MISS > prop_miss | (meanHet < lower_meanHet | meanHet > upper_meanHet),
select = c("FID", "IID"))
data.table::fwrite(raw_IMISSHETsub, "dummy_project/fail-raw_IMISSHETsub.txt", sep =" ")
View(raw_IMISSHETsub)
If all is right, you’d have something like Table 7.2.
library("flextable")
ft <- flextable(
raw_IMISSHETsub,
# col_keys = names(data),
# cwidth = 0.75,
# cheight = 0.25,
# defaults = list(),
theme_fun = theme_booktabs
)
#
# set_table_properties(ft, width = 1, layout = "autofit")
#
# fit_to_width(ft, max_width = 50)
#
autofit(ft, add_w = 0, add_h = 0)
FID | IID |
|---|---|
28 | 28 |
50 | 50 |
52 | 52 |
53 | 53 |
67 | 67 |
96 | 96 |
205 | 205 |
298 | 298 |
310 | 310 |
351 | 351 |
404 | 404 |
418 | 418 |
423 | 423 |
428 | 428 |
442 | 442 |
444 | 444 |
548 | 548 |
772 | 772 |
835 | 835 |
850 | 850 |
853 | 853 |
910 | 910 |
939 | 939 |
966 | 966 |
968 | 968 |
1,003 | 1,003 |
1,006 | 1,006 |
1,045 | 1,045 |
1,058 | 1,058 |
1,154 | 1,154 |
1,236 | 1,236 |
1,294 | 1,294 |
1,395 | 1,395 |
1,537 | 1,537 |
1,554 | 1,554 |
1,587 | 1,587 |
1,694 | 1,694 |
1,789 | 1,789 |
1,832 | 1,832 |
1,866 | 1,866 |
1,904 | 1,904 |
1,920 | 1,920 |
1,951 | 1,951 |
1,952 | 1,952 |
1,953 | 1,953 |
1,954 | 1,954 |
1,955 | 1,955 |
1,981 | 1,981 |
1,984 | 1,984 |
1,985 | 1,985 |
1,986 | 1,986 |
Using a Principal Component Analysis (PCA) we can reduce the dimensions of the data, and project the “ancestral distances”. In other words, the principal component 1 (the first dimension) and principal component 2 (the second dimension) which will capture most of the variation in the data and represent how much each sample is alike the next. And when compared to a reference, you can deduce the ancestral background of each sample in your dataset. Of course this is relative: we will only know that a given sample is very much a like samples from a given population that exists today.
Nowadays we run such PCA against a large and diverse dataset containing many different populations. Old-school GWAS (pre-2009) would compare a dataset against HapMap 3, nowadays we prefer at a minimum the 1000G phase 3 populations. And in those ancient times the preferred software to run a PCA was Eigensoft which is a bit tricky to install (see Chapter 20), but nowadays PLINK provides the --pca-flag.
For the purpose of this practical primer we will run PCA using PLINK and its --pca-flag against an earlier version of 1000G, phase 1, which is slightly smaller and just as good to use.
We will project our data to the reference, in this example 1000G phase 1 (1000G), which includes individuals from 14 distinct global populations across 4 ‘super’-populations (Europeans [EUR], Africans [AFR], East-Asians [EAS], and Latin Americans [AMR]). In the real-world, using phase 1 may be just fine, but if you think your population evolved through extensive migration it’s probably best to use phase 3 data. In other words, the choice of reference is really depending on the dataset.
First, we will merge our data with 1000G. The alleles at each marker must be aligned to the same DNA strand to allow our data to merge correctly. Because not all SNPs are required for this analysis the A->T and C->G SNPs, which are more difficult to align, can be omitted.
First, we should get a list of relevant variants from our rawdata-dataset. We don’t need the other variants present in the 1000G dataset, right?
cat dummy_project/rawdata.bim | grep "rs" > dummy_project/all.variants.txt
Extract those from the 1000G phase 1 data.
plink --bfile ~/data/shared/ref_1kg_phase1_all/1kg_phase1_all --extract dummy_project/all.variants.txt --make-bed --out dummy_project/1kg_phase1_raw
As explained, the A/T and C/G SNPs are problematic, we want to exclude these too. So let’s get a list of A/T and C/G variants from 1000G to exclude - this may take a while.
cat dummy_project/1kg_phase1_raw.bim | \
awk '($5 == "A" && $6 == "T") || ($5 == "T" && $6 == "A") || ($5 == "C" && $6 == "G") || ($5 == "G" && $6 == "C")' | awk '{ print $2, $1, $4, $3, $5, $6 }' \
> dummy_project/all.1kg.atcg.variants.txt
Exclude those A/T and C/G variants in both datasets and at the same time filter to only retain high-quality data and exclude non-autosomal variants.
plink --bfile dummy_project/1kg_phase1_raw --exclude dummy_project/all.1kg.atcg.variants.txt -make-bed --out dummy_project/1kg_phase1_raw_no_atcg
plink --bfile dummy_project/rawdata --exclude dummy_project/all.1kg.atcg.variants.txt --make-bed --out dummy_project/rawdata_1kg_phase1_raw_no_atcg
Try and merge the data.
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg --bmerge dummy_project/1kg_phase1_raw_no_atcg --make-bed --out dummy_project/rawdata.1kg_phase1
There probably is an error …
Error: 72 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
dummy_project/rawdata.1kg_phase1-merge.missnp.
(Warning: if the subsequent merge seems to work, strand errors involving SNPs
with A/T or C/G alleles probably remain in your data. If LD between nearby
SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
So let’s flip some variants.
plink --bfile dummy_project/rawdata --exclude dummy_project/all.1kg.atcg.variants.txt --flip dummy_project/rawdata.1kg_phase1-merge.missnp --make-bed --out dummy_project/rawdata_1kg_phase1_raw_no_atcg
Let’s try again and merge the data.
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg --bmerge dummy_project/1kg_phase1_raw_no_atcg --make-bed --out dummy_project/rawdata.1kg_phase1
There still is an error – there are a few multi-allelic variants present which PLINK can’t handle.
Error: 14 variants with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
dummy_project/rawdata.1kg_phase1-merge.missnp.
(Warning: if the subsequent merge seems to work, strand errors involving SNPs
with A/T or C/G alleles probably remain in your data. If LD between nearby
SNPs is high, --flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.
See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
Let’s just remove these multi-allelic variants.
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg --exclude dummy_project/rawdata.1kg_phase1-merge.missnp --make-bed --out dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi
After removing those pesky multi-allelic variants, we should be able to merge the data. We should take of the following:
--extract rawdata/raw-GWA-data.prune.i,--autosome,--maf 0.10 --geno 0.10 --hwe 1e-3,
plink --bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi \
--bmerge dummy_project/1kg_phase1_raw_no_atcg \
--autosome \
--maf 0.10 --geno 0.10 --hwe 1e-3 \
--extract ~/data/shared/rawdata/raw-GWA-data.prune.in \
--make-bed --out dummy_project/rawdata.1kg_phase1.clean
Before we continue it’s best to clean up a bit of the mess.
rm -fv dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi* dummy_project/1kg_phase1_raw_no_atcg* dummy_project/rawdata.1kg_phase1.pruned* dummy_project/rawdata_1kg_phase1_raw_no_atcg* dummy_project/rawdata.1kg_phase1-merge* dummy_project/rawdata.1kg_phase1.bed dummy_project/rawdata.1kg_phase1.bim dummy_project/rawdata.1kg_phase1.fam dummy_project/rawdata.1kg_phase1.hh dummy_project/rawdata.1kg_phase1.log dummy_project/1kg_phase1_raw.*
Now we have prepared our dataset with only high-quality SNPs that have few missing data, that are high-frequent, exclude problematic genomic ranges, and merged to the 1000G phase 1 reference dataset. Your output should look something like this:
PLINK v1.90b7.2 64-bit (11 Dec 2023) www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to dummy_project/rawdata.1kg_phase1.clean.log.
Options in effect:
--autosome
--bfile dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi
--bmerge ref_1kg_phase1_all/1kg_phase1_raw_no_atcg
--extract rawdata/raw-GWA-data.prune.in
--geno 0.10
--hwe 1e-3
--maf 0.10
--make-bed
--out dummy_project/rawdata.1kg_phase1.clean
16384 MB RAM detected; reserving 8192 MB for main workspace.
2000 people loaded from dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi.fam.
1092 people to be merged from ref_1kg_phase1_all/1kg_phase1_raw_no_atcg.fam.
Of these, 1092 are new, while 0 are present in the base dataset.
Warning: Multiple positions seen for variant 'rs3934834'.
Warning: Multiple positions seen for variant 'rs3737728'.
Warning: Multiple positions seen for variant 'rs6687776'.
Warning: Multiple chromosomes seen for variant 'rs1050301'.
Warning: Multiple chromosomes seen for variant 'rs4850'.
317476 markers loaded from dummy_project/rawdata_1kg_phase1_raw_no_atcg_bi.bim.
312239 markers to be merged from ref_1kg_phase1_all/1kg_phase1_raw_no_atcg.bim.
Of these, 14 are new, while 312225 are present in the base dataset.
312190 more multiple-position warnings: see log file.
Performing single-pass merge (3092 people, 308317 variants).
Merged fileset written to dummy_project/rawdata.1kg_phase1.clean-merge.bed +
dummy_project/rawdata.1kg_phase1.clean-merge.bim +
dummy_project/rawdata.1kg_phase1.clean-merge.fam .
308317 variants loaded from .bim file.
3092 people (1522 males, 1570 females) loaded from .fam.
3092 phenotype values loaded from .fam.
--extract: 49856 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3078 founders and 14 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.994867.
299 variants removed due to missing genotype data (--geno).
--hwe: 13825 variants removed due to Hardy-Weinberg exact test.
2617 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
33115 variants and 3092 people pass filters and QC.
Phenotype data is quantitative.
--make-bed to dummy_project/rawdata.1kg_phase1.clean.bed +
dummy_project/rawdata.1kg_phase1.clean.bim +
dummy_project/rawdata.1kg_phase1.clean.fam ... done.
So in total there are 3,092 individuals, 1,522 males and 1,570 females, and 3,078 founders and 14 non-founders. The total genotyping rate is 99.5% and 33,115 variants are present.
Great, we’ve prepared our dummy project data and merged this with 1000G phase 1. Let’s execute the PCA using --pca in PLINK.
plink --bfile dummy_project/rawdata.1kg_phase1.clean --pca --out dummy_project/rawdata.1kg_phase1.clean
If all is peachy, you just succesfully ran PCA against 1000G phase 1. Using --pca we have calculated principal components (PCs), 20 in total by default, and we can now start plotting them. Let’s create a scatter diagram of the first two principal components, including all individuals in the file rawdata.1kg_phase1.clean.eigenvec (the first and second principal components are columns 3 and 4, respectively). We need to collect some per-sample information to color the points according to sample origin.
First we collect the results from the --pca, the dummy data phenotype information, and the reference population information.
PCA_eigenval <- data.table::fread("dummy_project/rawdata.1kg_phase1.clean.eigenval")
PCA_eigenvec <- data.table::fread("dummy_project/rawdata.1kg_phase1.clean.eigenvec")
ref_pop_raw <- data.table::fread("~/data/shared/ref_1kg_phase1_all/1kg_phase1_all.pheno")
dummy_pop <- data.table::fread("dummy_project/rawdata.fam")
# Rename some
names(PCA_eigenval)[names(PCA_eigenval) == "V1"] <- "Eigenvalue"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V1"] <- "FID"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V2"] <- "IID"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V3"] <- "PC1"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V4"] <- "PC2"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V5"] <- "PC3"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V6"] <- "PC4"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V7"] <- "PC5"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V8"] <- "PC6"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V9"] <- "PC7"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V10"] <- "PC8"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V11"] <- "PC9"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V12"] <- "PC10"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V13"] <- "PC11"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V14"] <- "PC12"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V15"] <- "PC13"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V16"] <- "PC14"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V17"] <- "PC15"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V18"] <- "PC16"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V19"] <- "PC17"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V20"] <- "PC18"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V21"] <- "PC19"
names(PCA_eigenvec)[names(PCA_eigenvec) == "V22"] <- "PC20"
names(dummy_pop)[names(dummy_pop) == "V1"] <- "Family_ID"
names(dummy_pop)[names(dummy_pop) == "V2"] <- "Individual_ID"
names(dummy_pop)[names(dummy_pop) == "V5"] <- "Gender"
names(dummy_pop)[names(dummy_pop) == "V6"] <- "Phenotype"
dummy_pop$V3<- NULL
dummy_pop$V4<- NULL
dummy_pop$Population <- dummy_pop$Phenotype
dummy_pop$Population[dummy_pop$Population == 2] <- "Case"
dummy_pop$Population[dummy_pop$Population == 1] <- "Control"
# we subset the data we need
ref_pop <- subset(ref_pop_raw, select = c("Family_ID", "Individual_ID", "Gender", "Phenotype", "Population"))
rm(ref_pop_raw)
# we combine the reference and dummy information
ref_dummy_pop <- rbind(dummy_pop, ref_pop)
PCA_1kG <- merge(PCA_eigenvec,
ref_dummy_pop,
by.x = "IID",
by.y = "Individual_ID",
sort = FALSE,
all.x = TRUE)
# Population Description Super population Code Counts
# ASW African Ancestry in Southwest US AFR 4 #49A01D
# CEU Utah residents with Northern and Western European ancestry EUR 7 #E55738
# CHB Han Chinese in Bejing, China EAS 8 #9A3480
# CHS Southern Han Chinese, China EAS 9 #705296
# CLM Colombian in Medellin, Colombia MR 10 #8D5B9A
# FIN Finnish in Finland EUR 12 #2F8BC9
# GBR British in England and Scotland EUR 13 #1290D9
# IBS Iberian populations in Spain EUR 16 #1396D8
# JPT Japanese in Tokyo, Japan EAS 18 #D5267B
# LWK Luhya in Webuye, Kenya AFR 20 #78B113
# MXL Mexican Ancestry in Los Angeles, California AMR 22 #F59D10
# PUR Puerto Rican in Puerto Rico AMR 25 #FBB820
# TSI Toscani in Italy EUR 27 #4C81BF
# YRI Yoruba in Ibadan, Nigeria AFR 28 #C5D220
PCA_1kGplot <- ggpubr::ggscatter(PCA_1kG,
x = "PC1",
y = "PC2",
color = "Population",
palette = c("#49A01D",
"#595A5C",
"#E55738",
"#9A3480",
"#705296",
"#8D5B9A",
"#A2A3A4",
"#2F8BC9",
"#1290D9",
"#1396D8",
"#D5267B",
"#78B113",
"#F59D10",
"#FBB820",
"#4C81BF",
"#C5D220"),
xlab = "principal component 1", ylab = "principal component 2") +
ggplot2::geom_vline(xintercept = 0.0023, linetype = "dashed",
color = "#E55738", size = 1)
p2 <- ggpubr::ggpar(PCA_1kGplot,
title = "Principal Component Analysis",
subtitle = "Reference population: 1000 G, phase 1",
legend.title = "Populations", legend = "right")
ggplot2::ggsave("dummy_project/gwas-qc-pca-1000g.png", plot = p2)
p2
rm(p2)
Derive PC1 and PC2 thresholds so that only individuals who match the given ancestral population are included. For populations of European descent, this will be either the CEU or TSI 1000G individuals (Figure 7.3). Here, we chose to exclude all individuals with a first principal component score less than 0.0023.
Write the FID and IID of these individuals to a file called fail-ancestry-QC.txt.
cat dummy_project/rawdata.1kg_phase1.clean.eigenvec | \
awk '$3 < 0.0023' | awk '{ print $1, $2 }' > dummy_project/fail-ancestry-QC.txt
Choosing which thresholds to apply (and thus which individuals to remove) is not a straightforward process. The key is to remove those individuals with greatly divergent ancestry, as these samples introduce the most bias to the study. Identification of more fine-scale ancestry can be conducted by using less divergent reference samples (e.g., within Europe, stratification could be identified using the CEU, TSI (Italian), GBR (British), FIN (Finnish) and IBS (Iberian) samples from the 1,000 Genomes Project (http://www.1000genomes.org/)). Robust identification of fine-scale population structure often requires the construction of many (2–10) principal components.
knitr::include_graphics("img/gwas_dummy/gwas-qc-pca-1000g.png")
Figure 7.3: PCA - Your data vs. 1000g.
Finally! We have a list of samples of poor quality or divergent ancestry, and duplicated or related samples. We should remove these. Let’s collect all IDs from our fail-*-files into a single file.
cat dummy_project/fail-* | sort -k1 | uniq > dummy_project/fail-qc-inds.txt
This new file should now contain a list of unique individuals failing the previous QC steps which we want to remove.
plink --bfile dummy_project/rawdata --remove dummy_project/fail-qc-inds.txt --make-bed --out dummy_project/clean_inds_data
Question: How many variants and samples are left? How many cases and how many controls did you loose?
Now that you filtered samples, we should turn our attention to step 2 of the QC for GWAS: identify SNPs of poor quality in Chapter 8.
Now that we removed samples, we can focus on low-quality variants.
We start by calculating the missing genotype rate for each SNP, in other words the per-SNP call rate.
plink --bfile dummy_project/clean_inds_data --missing --out dummy_project/clean_inds_data
Let’s visualize the results to identify a threshold for extreme genotype failure rate. We chose a callrate threshold of 3%, but it’s arbitrary and depending on the dataset, the data (visualization), and the number of samples (Figure 8.1).
Figure 8.1: Per SNP call rate.
There could also be differences in genotype call rates between cases and controls. It is very important to check for this because these differences could lead to spurious associations. We can test all markers for differences in call rate between cases and controls, or based on other criteria.
plink --bfile dummy_project/clean_inds_data --test-missing --out dummy_project/clean_inds_data
Let’s collect all the SNPs with a significantly different (P < 0.00001) missing data rate between cases and controls.
cat dummy_project/clean_inds_data.missing | awk '$5 < 0.00001' | awk '{ print $2 }' > dummy_project/fail-diffmiss-qc.txt
We should also get an idea on what the allele frequencies are in our dataset. Low frequent SNPs should probably be excluded, as these are uninformative when monomorphic (allele frequency = 0), or they may lead to spurious associations.
plink --bfile dummy_project/clean_inds_data --freq --out dummy_project/clean_inds_data
Let’s also plot these data. You can view the result below, and try it yourself (Figure 8.2).
Figure 8.2: Minor allele frequency.
Oh, one more thing about alleles.
PLINK codes alleles as follows:
A1 = minor allele, the least frequent allele A2 = major allele, the most frequent allele
And when you use PLINK the flag --freq or --maf is always relative to the A1-allele, as is the odds ratio (OR) or effect size (beta).
However, SNPTEST makes use of the so-called OXFORD-format, this codes alleles as follows:
A = the ‘other’ allele B = the ‘coded’ allele
When you use SNPTEST it will report the allele frequency as CAF, in other words the coded allele frequency, and the effect size (beta) is always relative to the B-allele. This means, CAF could be the MAF, or minor allele frequency, but this is not a given.
In other words, always make sure what the allele-coding of a given program, be it PLINK, SNPTEST, GCTA, et cetera, is! I cannot stress this enough. Ask yourself: ‘what is the allele frequency referring to?’, ‘the effect size is relative to…?’.
Right, let’s continue.
Because we are performing a case-control genome-wide association study, we probably expect some differences in Hardy-Weinberg Equilibrium (HWE), but extreme deviations are probably indicative of genotyping errors.
plink --bfile dummy_project/clean_inds_data --hardy --out dummy_project/clean_inds_data
Let’s also plot these data. You can view the result below, and type over the code to do it yourself.
Figure 8.3: Hardy-Weinberg Equilibrium p-values per stratum.
We are ready to perform the final QC. After inspecting the graphs we will filter on a MAF < 0.01, call rate < 0.05, and HWE < 0.00001 (Figure 8.3), in addition those SNPs that failed the differential call rate test will be removed.
plink --bfile dummy_project/clean_inds_data --exclude dummy_project/fail-diffmiss-qc.txt --maf 0.01 --geno 0.05 --hwe 0.00001 --make-bed --out dummy_project/cleandata
Congratulations. You reached a very important milestone. Now that you filtered samples and SNPs, we can finally start the association analyses in Chapter 9.
Now that you have learned how to perform QC, you can easily run a GWAS and execute some downstream visualization and analyses. Let’s do this with a dummy dataset.
Even though someone says that the QC was done, it is still wise and good practice to run some of the commands above to get a ‘feeling’ about the data. So let’s do this.
First, we’ll clean up what we did before - we don’t need it anyways.
clear
rm -v dummy_project/*
With the command clear you create a blank Terminal screen. With rm -v you remove all files in the dummy_project directory, where the -v-flag means that you get a verbose output of what is removed.
plink --bfile ~/data/shared/gwas/gwa --freq --out dummy_project/gwa
plink --bfile ~/data/shared/gwas/gwa --missing --out dummy_project/gwa
plink --bfile ~/data/shared/gwas/gwa --hardy --out dummy_project/gwa
Let’s visualize the results. First we should load in all the results.
Question: Load the data using R. [Hint: use and adapt the examples from the previous chapters.]
We can plot the per-stratum HWE p-values.
Question: Plot the per-stratum HWE p-values using R. [Hint: use and adapt the examples from the previous chapters.]
Figure 9.1: Per stratum HWE p-values.
We will want to see what the distribution of allele frequencies looks like.
Question: Plot the allele frequencies using R. [Hint: use and adapt the examples from the previous chapters.]
Figure 9.2: Minor allele frequencies.
We will want to identify samples that have poor call rates.
Question: Plot the per-sample call rates using R. [Hint: use and adapt the examples from the previous chapters.]
Figure 9.3: Per sample call rates.
We also need to know what the per SNP call rates are.
Question: Plot the per-SNP call rates using R. [Hint: use and adapt the examples from the previous chapters.]
Figure 9.4: Per SNP call rates.
A simple chi-square test of association can be done.
plink --bfile ~/data/shared/gwas/gwa --model --out dummy_project/data
Genotypic, dominant and recessive tests will not be conducted if any one of the cells in the table of case-control by genotype counts contains less than five observations. This is because the chi-square approximation may not be reliable when cell counts are small. For SNPs with MAFs < 5%, a sample of more than 2,000 cases and controls would be required to meet this threshold and more than 50,000 would be required for SNPs with MAF < 1%.
You can change this default behaviour by adding the flag --cell, e.g., we could lower the threshold to 3.
plink --bfile ~/data/shared/gwas/gwa --model --cell 3 --out dummy_project/data
Let’s review the contents of the results.
It contains 1,530,510 rows, one for each SNP, and each type of test (genotypic, trend, allelic, dominant, and recessive) and the following columns:
CHR],SNP],A1] (remember, PLINK always codes the A1-allele as the minor allele!),A2],TEST]:
GENO (genotypic association);TREND (Cochran-Armitage trend);ALLELIC (allelic as- sociation);DOM (dominant model); andREC (recessive model)],AFF],UNAFF],CHISQ],DF],P] of association.Question: Do you know which model, i.e.
TESTis most commonly used and reported? And why is that, do think?
We can also perform a test of association using logistic regression. In this case we might want to correct for covariates/confounding factors, for example age, sex, ancestral background, i.e. principal components, and other study specific covariates (e.g. hospital of inclusion, genotyping centre etc.). In that case each of these P values is adjusted for the effect of the covariates.
When running a regression analysis, be it linear or logistic, PLINK assumes a multiplicative model. By default, when at least one male and one female is present, sex (male = 1, female = 0) is automatically added as a covariate on X chromosome SNPs, and nowhere else. The sex flag causes it to be added everywhere, while no-x-sex excludes it.
plink --bfile ~/data/shared/gwas/gwa --logistic sex --covar ~/data/shared/gwas/gwa.covar --out dummy_project/data
Let’s examine the results.
## [1] 918306 9
## CHR SNP BP A1 TEST NMISS OR STAT P
## <int> <char> <int> <char> <char> <int> <num> <num> <num>
## 1: 1 rs3934834 995669 T ADD 3818 1.0290 0.38120 0.7031
## 2: 1 rs3934834 995669 T AGE 3818 1.0020 1.11800 0.2635
## 3: 1 rs3934834 995669 T SEX 3818 1.0120 0.19090 0.8486
## 4: 1 rs3737728 1011278 A ADD 3982 1.0190 0.38670 0.6990
## 5: 1 rs3737728 1011278 A AGE 3982 1.0020 1.09800 0.2721
## 6: 1 rs3737728 1011278 A SEX 3982 1.0060 0.09898 0.9212
## 7: 1 rs6687776 1020428 T ADD 3915 0.9692 -0.33330 0.7389
## 8: 1 rs6687776 1020428 T AGE 3915 1.0020 1.04000 0.2984
## 9: 1 rs6687776 1020428 T SEX 3915 1.0150 0.23690 0.8127
Question: How come there are more lines in this file than there are variants?
If no model option is specified, the first row for each SNP corresponds to results for a multiplicative test of association. The C >= 0 subsequent rows for each SNP correspond to separate tests of significance for each of the C covariates included in the regression model. We can remove the covariate-specific lines from the main report by adding the hide-covar flag.
The columns in the association results are:
CHR],SNP],BP],A1],TEST]: ADD (multiplicative model or genotypic model testing additivity),
GENO_2DF (genotypic model),DOMDEV (genotypic model testing deviation from additivity),DOM (dominant model), orREC (recessive model)],NMISS],OR relative to the A1, i.e. minor allele,STAT], andP] of association.We need to calculate the standard error and confidence interval from the z-statistic. We can modify the effect size (OR) to output the beta by adding the beta flag.
Question: Can you write down the mathematical relation between beta and OR?
Looking at numbers is important, but it won’t give you a perfect overview. We should turn to visualizing our results in Chapter 10.
Data visualization is key, not only for presentation but also to inspect the results.
We should create quantile-quantile (QQ) plots to compare the observed association test statistics with their expected values under the null hypothesis of no association and so assess the number, magnitude and quality of true associations.
First, we will add the standard error, call rate, A2, and allele frequencies.
Let’s list the number of SNPs per chromosome. This gives a pretty good idea about the per-chromosome coverage. And it’s a sanity check: did the whole analysis run properly (we expect 22 chromosomes)?
Chr | Freq |
|---|---|
1 | 23,173 |
2 | 25,206 |
3 | 21,402 |
4 | 19,008 |
5 | 19,157 |
6 | 20,672 |
7 | 16,581 |
8 | 18,089 |
9 | 15,709 |
10 | 15,536 |
11 | 14,564 |
12 | 14,889 |
13 | 11,524 |
14 | 9,822 |
15 | 8,838 |
16 | 8,920 |
17 | 8,262 |
18 | 10,356 |
19 | 5,820 |
20 | 7,792 |
21 | 5,412 |
22 | 5,370 |
Question: Try to figure out how to get the number of variants per chromosomes. Why do the number of variants per chrosome (approximately) correlate with the chromosome number?
Question: Where are the data for chromosome X, Y and MT?
Let’s plot the QQ plot to diagnose our GWAS.
Since we’re working in Utrecht on the Utrecht Science Park, let’s get the color scheme in R.
Figure 10.1: A QQ plot.
We also need to create a Manhattan plot to display the association test P-values as a function of chromosomal location and thus provide a visual summary of association test results that draw immediate attention to any regions of significance (Figure 10.2).
Figure 10.2: A manhattan plot.
It is also informative to plot the density per chromosome. We can use the CMplot for that which you can find here. For now we just make these graphs ‘quick-n-dirty’, you can further prettify them, but you easily loose track of time, so maybe carry on.
This would lead to the following graphs.Question: What do the grey spots on the density plot indicate?
Figure 10.3: SNP density of the association results.
Figure 10.4: A QQ plot including a 95% confidence interval (blue area) and genome-wide significant hits (red).
Figure 10.5: A regular manhattan plot. Colored by chromosome, suggestive hits are green, genome-wide hits are red. The bottom graph shows the per-chromosome SNP density.
Figure 10.6: A circular manhattan.
You can also make an interactive version of the Manhattan - just because you can. The code below shows you how.
library(plotly)
library(dplyr)
# Prepare the dataset (as an example we use the data (gwasResults) from the `qqman`-package)
don <- gwasResults %>%
# Compute chromosome size
group_by(CHR) %>%
summarise(chr_len=max(BP)) %>%
# Calculate cumulative position of each chromosome
mutate(tot=cumsum(chr_len)-chr_len) %>%
select(-chr_len) %>%
# Add this info to the initial dataset
left_join(gwasResults, ., by=c("CHR"="CHR")) %>%
# Add a cumulative position of each SNP
arrange(CHR, BP) %>%
mutate( BPcum=BP+tot) %>%
# Add highlight and annotation information
mutate( is_highlight=ifelse(SNP %in% snpsOfInterest, "yes", "no")) %>%
# Filter SNP to make the plot lighter
filter(-log10(P)>0.5)
# Prepare X axis
axisdf <- don %>% group_by(CHR) %>% summarize(center=( max(BPcum) + min(BPcum) ) / 2 )
# Prepare text description for each SNP:
don$text <- paste("SNP: ", don$SNP, "\nPosition: ", don$BP, "\nChromosome: ", don$CHR, "\nLOD score:", -log10(don$P) %>% round(2), "\nWhat else do you wanna know", sep="")
# Make the plot
p <- ggplot(don, aes(x=BPcum, y=-log10(P), text=text)) +
# Show all points
geom_point( aes(color=as.factor(CHR)), alpha=0.8, size=1.3) +
scale_color_manual(values = rep(c("grey", "skyblue"), 22 )) +
# custom X axis:
scale_x_continuous( label = axisdf$CHR, breaks= axisdf$center ) +
scale_y_continuous(expand = c(0, 0)) + # remove space between plot area and x axis
# Add highlighted points
geom_point(data=subset(don, is_highlight=="yes"), color="orange", size=2) +
# Custom the theme:
theme_bw() +
theme(
legend.position="none",
panel.border = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
ggplotly(p, tooltip="text")
It will produce something like below. In the CoCalc Notebook-window it isn’t smooth, but this way you have an idea how to do it in a real-case scenario.
Again, this is an example with dummy data - you can try to do it for our GWAS, but careful with the time. You can also choose to carry on.
You will encounter the above types of visualizations in any high-quality GWAS paper, because each is so critically informative. Usually, analysts of large-scale meta-analyses of GWAS will also stratify the QQ-plots based on the imputation quality (if your GWAS was imputed), call rate, and allele frequency (although that is rarely shared in publications, not even in supplemental material).
Alright. It’s time to stop playing around and do a quick recap of what you’ve learned.
You should be ready for the real stuff. And if not, the next chapter will help you get ready: Chapter @ref(wtccc1_intro).
Now that you know your way around PLINK, bash and r and have done some basic quality control and association testing, you are ready for the real thing. We have prepared a real dataset: the first release of the Welcome Trust Case-Control Consortium (WTCCC) on coronary artery disease (CAD) and a control dataset used for that project.
The WTCCC1 data were genotyped using a chip from Affymetrix, nowadays part of ThermoFisher. As a brand Affymetrix still exists, but the chips aren’t made anymore. Unfortunately, most links to the old generation Affymetrix chips are borken, but you can still find some information about the 500K chip that was used for WTCCC1. It’s good practice to read up a bit on what chip was used, and what support materials are available.
Before quality control the original data included:
Your assignment in the next chapter is to do the following:
wtccc1.As I wrote, you are ready for the real stuff in Chapter 12.
As usual, we start by exploring the data in hand.
First, let’s make a clean slate and create a working directory for the WTCCC1 data.
clear
mkdir -v ~/wtccc1
Next, we’ll run a few plink commands
plink --bfile ~/data/shared/wtccc1/CADn1871_500Kb37fwd --bmerge ~/data/shared/wtccc1/UKBSn1397_500Kb37fwd --make-bed --out wtccc1/wtccc1 && \
plink --bfile wtccc1/wtccc1 --freq --out wtccc1/wtccc1 && \
plink --bfile wtccc1/wtccc1 --hardy --out wtccc1/wtccc1 && \
plink --bfile wtccc1/wtccc1 --missing --out wtccc1/wtccc1 && \
plink --bfile wtccc1/wtccc1 --test-missing --out wtccc1/wtccc1
cat wtccc1/wtccc1.missing | awk '$5 < 0.00001' | awk '{ print $2 }' > wtccc1/wtccc1-fail-diffmiss-qc.txt
Let’s investigate the HWE p-value in the whole cohort, and per stratum (cases and controls) with the code below.
This will result in Figure 12.1.
Figure 12.1: Stratified HWE p-values.
We should also inspect the allele frequencies. Note that by default PLINK (whether v0.7, v1.9, or v2.0) stores the alleles as minor (A1) and major (A2), and therefore --maf always calculates the frequency of the minor allele (A1).
This will result in Figure 12.2.
Figure 12.2: Minor allele frequencies.
There could be sample with very poor overall call rate, where for many SNPs there is no data. We will want to identify these samples and exclude them.
This will result in Figure 12.3.
Figure 12.3: Per sample call rate.
Question: What do you notice in the ‘per sample call rate’ graph? Can you think of a reason why this is? And how would you deal with this?
Lastly, we must inspect the per SNP call rate; we need to know if there are SNPs that have no data for many samples. We will want to identify such SNPs and exclude these.
This will result in Figure 12.4.
Figure 12.4: Per SNP call rate.
Now that we have handle on the data, we can filter it.
Question: Do you have any thoughts on that? Do you agree with the filters I set below? How would you do it differently and why?
plink --bfile wtccc1/wtccc1 --exclude wtccc1/wtccc1-fail-diffmiss-qc.txt --maf 0.01 --geno 0.05 --hwe 0.00001 --make-bed --out wtccc1/wtccc1_clean
If these individuals are all from the United Kingdom, we are certain there will be admixture from other populations given UK’s history. Let’s project the WTCCC1 data on 1000G phase 1 populations.
We will face the same issue as before with our dummy dataset with respect to EIGENSOFT. So I created the data for you to skip to the Plotting PCA section immediately. Regardless, in the Preparing PCA and Running PCA sections I show you how to get there.
Filtering WTCCC1
For PCA we need to perform extreme clean.
plink --bfile wtccc1/wtccc1_clean --maf 0.1 --geno 0.1 --indep-pairwise 100 50 0.2 --exclude ~/data/shared/support/exclude_problematic_range.txt --make-bed --out wtccc1/wtccc1_temp
plink --bfile wtccc1/wtccc1_temp --exclude wtccc1/wtccc1_temp.prune.out --make-bed --out wtccc1/wtccc1_extrclean
rm -fv wtccc1/wtccc1_temp*
cat wtccc1/wtccc1_extrclean.bim | awk '{ print $2 }' > wtccc1/wtccc1_extrclean.variants.txt
cat wtccc1/wtccc1.bim | grep "rs" > wtccc1/all.variants.txt
Notice that you are using real world data: there are thousands of variants ‘pruned’ due to the --indep-pairwise 100 50 0.2-flag.
Merging WTCCC1 with 1000G phase 1
Now we are ready to extract the WTCCC1 variants from the 1000G phase 1 reference
plink --bfile ~/data/shared/ref_1kg_phase1_all/1kg_phase1_all --extract wtccc1/all.variants.txt --make-bed --out wtccc1/1kg_phase1_wtccc1
Extracting the A/T and C/G SNPs as well.
cat wtccc1/1kg_phase1_wtccc1.bim | \
awk '($5 == "A" && $6 == "T") || ($5 == "T" && $6 == "A") || ($5 == "C" && $6 == "G") || ($5 == "G" && $6 == "C")' | awk '{ print $2, $1, $4, $3, $5, $6 }' \
> wtccc1/all.1kg_wtccc1.atcg.variants.txt
plink --bfile wtccc1/1kg_phase1_wtccc1 --exclude wtccc1/all.1kg_wtccc1.atcg.variants.txt --make-bed --out wtccc1/1kg_phase1_wtccc1_no_atcg
plink --bfile wtccc1/1kg_phase1_wtccc1_no_atcg --extract wtccc1/wtccc1_extrclean.variants.txt --make-bed --out wtccc1/1kg_phase1_raw_no_atcg_wtccc1
Finally we will merge the datasets.
plink --bfile wtccc1/wtccc1_extrclean --bmerge wtccc1/1kg_phase1_raw_no_atcg_wtccc1 --maf 0.1 --geno 0.1 --exclude ~/data/shared/support/exclude_problematic_range.txt --make-bed --out wtccc1/wtccc1_extrclean_1kg
Great, we’ve prepared our dummy project data and merged this with 1000G phase 1. Let’s execute the PCA using --pca in PLINK.
plink --bfile wtccc1/wtccc1_extrclean_1kg --pca --out wtccc1/wtccc1_extrclean_1kg
If all is peachy, you were able to run the PCA for the WTCCC1 data against 1000G phase 1. Using --pca in plink we have calculated principal components (PCs) and we can now start plotting them. Let’s create a scatter diagram of the first two principal components just like we did with the dummy data.
And we should visualize the PCA results: are these individuals really all from European (UK) ancestry?
We expect most individuals from the WTCCC to be 100% British, but a substantial group will have a different ancestral background as shown in the Figure 12.5 you just made.
Figure 12.5: PCA - WTCCC1 vs. 1000G
In a similar fashion as in the example gwas and rawdata datasets, you should consider to remove the samples below the threshold based on this PCA (Figure 12.5).
Go ahead, try that.
You’re code would be something like below:
cat wtccc1/wtccc1_extrclean_1kg.eigenvec | \
awk '$3 < 0.0023' | awk '{ print $1, $2 }' > wtccc1/fail-ancestry-QC.txt
Next we filter these samples and get a final fully QC’d dataset.
plink --bfile wtccc1/wtccc1_clean --exclude wtccc1/fail-ancestry-QC.txt --make-bed --out wtccc1/wtccc1_qc
Now that we have explored the data, we are ready for some simple association testing. However, it would be great to have some PCs to correct for. We can use PLINK for that too.
plink --bfile wtccc1/wtccc1_extrclean --exclude wtccc1/fail-ancestry-QC.txt --pca --out wtccc1/wtccc1_extrclean
Let’s add those PCs to the covariates-file.
echo "IID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20" > wtccc1/wtccc1_qc.pca
cat wtccc1/wtccc1_extrclean.eigenvec | awk '{ print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22}' >> wtccc1/wtccc1_qc.pca
perl ~/data/shared/scripts/mergeTables.pl --file1 wtccc1/wtccc1_qc.pca --file2 ~/data/shared/wtccc1/wtccc1.covar --index IID --format NORM > wtccc1/wtccc1_qc.covar_pca
Exciting, now we are ready to perform a GWAS on CAD in WTCCC1. This may take some time (when I tested it, it took at least 5 minutes), so it’s a perfect moment to take a break, stretch your legs and get a coffee.
plink --bfile wtccc1/wtccc1_qc --logistic sex --covar wtccc1/wtccc1_qc.covar_pca --out wtccc1/wtccc1_qc_log_covar_pca
After you ran the association analysis, you’re ready to process the data and take a first look at the results. First, we prepare the raw output.
You could visualize these results with the code below.
This would lead to the following graphs.Figure 12.6: SNP density of the association results.
Figure 12.7: A QQ plot including a 95% confidence interval (blue area) and genome-wide significant hits (red).
Figure 12.8: A regular manhattan plot. Colored by chromosome, suggestive hits are green, genome-wide hits are red. The bottom graph shows the per-chromosome SNP density.
Figure 12.9: A circular manhattan.
You reached an important milestone.
You recreated the work by the whole WTCCC1-team that took them years in just one afternoon!
Wow. Take a pause. And realize what you’ve done.
Back then there wasn’t much on analyses after a GWAS, nowadays there are many post-GWAS analyses methods. We will cover them in the next Chapter @ref(post_gwas).
A critical step in post-GWAS analysis is probably ‘mapping SNPs to genes’. It is critical, but it is also the most challenging. How do you even map SNPs to genes? What criteria to use? Should we take into account physical position? Or is it of interest that certain SNPs might influence downstream or upstream gene expression? And what of the fact that most loci discovered in GWAS are _inter_genic? What is the heritability of our trait? Are there any pleiotropic effects?
In the next few sections I deal with a couple of the downstream, post-GWAS analyses. We don’t have time to go practically into a few of these steps, but I do believe it is important to provide you with some pointers where to start.
The Manhattan plot immediately draws your attention to the peaks above the genome-wide significance threshold. Clumping is the procedure in which we identify the independent hits, those top variants and the variants in linkage disequilibrium (their ‘LD buddies’) in the same genomic region (locus). The basic steps are as follows. First define a threshold above which to identify the top variant, usually this is the genome-wide significance threshold. Next, define the maximum p-value of association a variant may have with the trait of interest. Define the strength of the correlation that is allowed between the top variant and its LD buddies. And lastly define the size of region around the top variant to assess.
Clumping can easily be done in PLINK. And you can too with the dummy GWAS dataset in this additional chapter @ref(regional_plot_cocalc). You don’t have to do it now. You can choose to do it when you have time.
The visualization of a GWAS is very appealing and draws attention to single peaks, and the most significant variant. And with clumping you can identify the independent signals, i.e. the top variants and their LD buddies in the locus. We implicitly assume that the top variant captures all the variation in the region by its linkage disequilibrium with an unknown causal variant, in other words it is ‘tagging’ the (un)measured potential causal variant in the region. Intuitively it is unlikely that a single causal variant is accounting for all the LD between the unknown causal variant and the measured (genotyped or imputed) variants at the locus. So, a naive focus on only the top variant ignores the fact that multiple causal variants exist and thus the variation captured by that single top variant underestimates the total variation that could be explained by that single region (locus).
Conditional analysis is a tool to identify the secondary variants associated with the trait of interest. This involves conditioning the association model (trait ~ Variant X + covariates) on the primary variant associated in that locus. A more comprehensive and general approach would be to condition on each variant in the whole genome starting with the top variants and stepwise selecting additional variants according to their conditional p-values. This strategy could potentially result in multiple top variants in the same locus. In other words, it could uncover different haplotypes that are causing the same association with the trait of interest.
This would be straightforward with super computers and full-access to individual level data. Unfortunately we often have neither.
Don’t despair, a method exists that fills this gap. With the program GCTA you can execute conditional analyses on GWAS summary statistics - the kind you just produced with the dummy data and with the WTCCC1 dataset - and a proper reference. You can read more on this in the paper.
For now, we will skip this part. I will add a whole chapter on this at some later stage.
Statistical finemapping is closely related to conditional analysis. Where conditional analysis identifies secondary signals in a region associated to the trait of interest, statistical finemapping answers the question ‘which variants are likely causal to the trait?’. In more formal words, through statistical finemapping we identify the 95% credible set of causal variants. There are multiple methods and tools developed to get to this answer. I list a few and I encourage you to read up on these.
This tool is very fast and versatile, and was developed by Christian Benner. It will identify the credible sets for each locus and calculates the posterior probabilities for each set. This is the more hands-on version of getting to credible sets.
You could upload - privately or publicly - your data to my.locuszoom.org and obtain a list of causal variants with posterior probabilities. It uses a simple procedure to obtain the credible sets using these scripts. This is the more lazy version of getting a list of likely causal variants.
An laternate approach to credible set identification is through SuSIE. Through SuSIE you can identify credible sets under the assumption of multiple causal variants, whereas FINEMAP assumes a single causal variant.
For now, we will skip this part. I will add a whole chapter on this at some later stage.
Researchers from the VUMC in Amsterdam have created an online tool that aids in mapping genes and function to GWAS: “Functional Mapping and Annotation of Genome-Wide Association Studies” a.k.a. FUMA. This online tool uses a variety of datasets and programs to prioritize genes and map these to associated loci.
We have covered some aspects of post-GWAS analyses, and a lot are covered by FUMA. Let’s try and annotate our WTCCC1 results. The assignment in this chapter 14 will be a bit more Do It Yourself.
A phenome-wide association study, or PheWAS, deals with assessing the association of top variants identified in GWAS with other traits. Through a PheWAS you’ll to get a notion whether on any of the top variants, the independent hits in your GWAS, have pleiotropic effects. In other words, whether your independent hits have effects on other traits too. This will paint a picture as it were about the role the genetic locus you identified plays in life: is it unique to your trait or does it affect other traits as well? What does it mean when your top variant near the gene FTO associates not only to BMI, but also to type 2 diabetes and coronary artery disease? PheWAS will not answer that question, but they do help in inventory all the other traits to which your hits are associated.
Again, we’ll get a bit hands-on and more Do It Yourself in the chapter that deals with this. Let’s jump over to Chapter 15.
From all of the above follows another intuitive question. Suppose a signal in your trait also shows association - to some extent - with other traits. Do these two signals than significantly overlap, more so than you would think based on chance? Colocalization deals with exactly this question.
We are not going to deal with colocalization, but a nice starting point is RACER about which you can read more in chapter @ref(regional_plot_cocalc). I will add a whole chapter on this at some later stage.
Because of the underlying biology or because of the way they are measured or calculated, traits can be highly correlated - phenotypically. That is to say, when the levels of HDL are high, LDL is probably low, and so when you draw a correlation plot from some data obtained in a general population you’ll see a nice pattern. Biology may also cause traits to be genetically correlated: the same genetic variants influence multiple traits. In other words, if a variant is associated with higher levels of LDL-cholesterol, it may alternatively be associated with lower HDL-cholesterol, etc. Genetic correlation will aid in further understanding the relations between traits biologically: if there is a strong genetic correlation, it is may be due to the same biological pathways and so the traits are ‘linked’ through the same processes (maybe counter-acting processes).
Puzzle: what other phenomenom could cause a high but spurious genetic correlation?
We are not going to deal with genetic correlation. I will add a whole chapter on this at some later stage.
In observational studies we may find a strong association between a certain risk factor or biomarker and a disease. For instance, epidemiological studies show that high circulating cystatin C is associated with risk of cardiovascular disease (CVD), independent of creatinine-based renal function measurements(Laan S. W. 2016). However, residual confounding and reverse causality remain alternative explanations for the strong correlation between cystatin C and CVD, both of which are difficult to tease apart from traditional observational (epidemiological) studies.
Mendelian Randomization (MR) harnesses the properties of the genome to enable causal inference of a biomarker. Specifically, the invariant nature of the genome and the random distribution of alleles from parents to offspring at conception mean that genetic information is not influenced by disease status (reverse causality) and should be free from confounding by traditional risk factors(Laan S. W. 2016). So, genetic variation that modulates serum concentrations of cystatin C could serve as an instrumental variable to assess the effect of lifelong elevated concentrations of cystatin C on disease risk, independent of potential confounders(Laan S. W. 2016).
Finding the causal genes for complex disease is a primary objective of many researchers, because these genes are putative therapeutic targets. In this course we intend to find out whether Type 2 Diabetes (T2D) causes coronary artery disease and ischemic stroke.
There is an easy, quick-and-dirty way and a harder way.
The quick-and-dirty way uses MRBase. It is based on the TwoSampleMR package and uses the MRInstruments package to load in all the genetic instruments. The creators of MRBase have curated hundreds of GWAS and molecular QTL studies, and prepared the data for easy use on the website and in the R packages.
The hard way means getting your hands dirty and using the TwoSampleMR package yourself in R. You’ll learn all the steps you need to take into account to get to the answer of this question. You’ll create the diagnostic graphs and calculate the statistics.
Your choice. In both cases you’ll learn how to do execute a MR, what to look out for in the statistics and diagnostic graphs, and how interpret the results. Your choice: Chapter @ref(mr_mrbase) or Chapter @ref(mr_twosamplemr).
In a future version I will aim to include a partial replication of the Cystatin C paper(Laan S. W. 2016)
Under the assumption of polygenicity, many variants with small effects contribute to the phenotypic variation in a trait or the risk to disease. Sample size and accurate phenotyping have a major impact on the power of a GWAS. It is estimated that 70% of variants that reach 10-6 in an initial discovery GWAS will reach genome-wide significance with increasing sample sizes. The individual variants that do reach genome-wide significance the first or second time around do not explain all the phenotypic variation or residual disease risk.
To capture the polygenicity of a trait beyond individual genome-wide significant hits, we can calculate a polygenic score (PGS). There are several methods to calculate PGS, but essentially it comes down to multiplying the effect size at a given variant with the number of effect alleles a given individual carries. Depending on the method you can use the ‘hard-coded’ genotyped data, or the imputed data. In case of the latter the effect sizes are multiplied by the genotype probabilities or dosages, this ensures that the PGS takes into account the uncertainty intrinsic to imputed data.
Let’s consider the most intuitive PGS-method which is based on p-value thresholding, which was nicely applied in this classic paper. This method selects variants based on their p-value and than calculates the PGS. This can be done using increasingly liberal significance thresholds (pT), for instance starting with all variants with p < 10-8, next p < 10-7, p < 10-6, p < 10-5, p < 10-4, p < 10-3, p < 0.05, p < 0.1, p < 0.2, p < 0.3, p < 0.4, p < 0.5. Under the assumption of polygenicity we expect an increasingly stronger correlation (r2) of the PGS at a certain pT with the trait of interest. Another application of this method is provided by Van Setten(Setten J. 2015).
Lastly, there is a great explanatory website on how to calculate PGS, and it also shows one application of PGS (although much debate ensues about this particular application!).
You come to the end of this practical primer. What is left are a summary (not written yet) of what you’ve learned and should take home. And some musings in the Epilogue (18 on this book and what the future holds.
Go to the FUMA website, get an account, and study the online-tutorial.
You will need to use the fwrite function in r to write the concatenated results of the WTCCC1 study (remember: wtccc1_assoc_compl). Make sure you know what the column-names are in the file, you’ll need that for FUMA. You will also need to compress the resulting output, this will make uploading to FUMA go faster. Next, you can download this file and upload it to FUMA.
Task: You have gotten so much experience now in handling files with R. You should be able to do this without any problems.
Question: can you figure out the sample size of the WTCCC1 data you used?
Upload via the form on the FUMA-website. Since you’ve done the tutorial you are familiar with its options.
Select everything in the tabs Gene Mapping (positional mapping), Gene Mapping (eQTL mapping) but not GTEx v6 and GTEx v7, and Gene Mapping (3D Chromatin Interaction mapping) and leave the settings at Gene types and MHC region as-is. At MAGMA analysis set the MAGMA gene expression analysis to include all tissues, but GTEx v6 and GTEx v7.
Don’t forget to give your analysis a name.
This will take some time and so it’s a good moment to carry on with the rest of the practical or take a break, or study for the exam.
Questions
- How many lead SNPs did we find?
- What do the results of MAGMA (the gene-based test) look like and how many genes pass the threshold of multiple testing correction?
- How many loci were mapped?
- How many genes were physically located and how many were mapped to these loci?
- Do you think all loci are ‘correct’, i.e. do you ‘believe’ all the signals looking at the mapping results? Why?
- For what tissues are the signals enriched?
- Are there any chromatin interactions discovered?
Now that you mapped SNPs to genes, it’s time to go back to ‘My Jobs’. Select your job and perform GENE2FUNC.
Questions
- What genes show the lowest expression across tissues?
- And what genes the highest?
- For what pathways are the signals enriched?
- What molecular functions are mapped to the signals?
FUMA is a great tool, but it comes with a caveat. It includes a couple of references of which it is not readily clear which variants are included - the authors do provide the codes used on Git, but still, you don’t know which variants precisely are filtered. That is key: perhaps the top variant you discovered is filtered in the reference. This means FUMA will not use it to map SNPs to genes, rather next best variant. This should be in high-LD - but, again, assumptions… And of course, the references used might not match your data well enough.
So, my advice: use FUMA (why not be lazy rather than work hard?), but aware of such caveats as I described. All in all, I do think FUMA is very complete, intuitive, and it makes your work publication-ready because it creates just the right file-formats for you too (.png, .svg, .pdf, .jpeg).
That said, time to move on to inspect other phenotypes in relation to your findings in the next Chapter 15 or to return to the previous chapter on post-GWAS analyses @ref(post_gwas).
Given the large biobanks available nowadays that have also genotyped the participants and collected a vast-array of information on them, it is possible to perform a Phenome-Wide Association Study.
There are several options in (Table 15.1).
| Site | Description |
|---|---|
| https://atlas.ctglab.nl | 4,155 GWAS based on UK Biobank release 2 |
| https://biobankengine.stanford.edu | UK Biobank based |
| http://pheweb.sph.umich.edu | University of Michigan EHR-based PheWAS |
| http://pheweb.sph.umich.edu/SAIGE-UKB/ | 1,403 GWAS UK Biobank based |
| http://phewas.mrbase.org | UK Biobank based |
Assignment
You now know to what other traits, risk factors and diseases our trait of interest (coronary artery disease) is correlated. An obvious next question would be: is this a causal correlation? We will cover this aspect in the next Chapter @ref(mendelian_randomization) or to return to the previous chapter on post-GWAS analyses @ref(post_gwas).
You took the easy way out. No shame in that. Your choice. Right, so as I wrote, MRBase is very easy to handle. It is based on the TwoSampleMR package and uses the MRInstruments package to load in all the genetic instruments. The creators of MRBase have curated hundreds of GWAS and molecular QTL studies, and prepared the data for easy use on the website and in the R packages.
Using MRBase we intend to find out whether Type 2 Diabetes (T2D) causes coronary artery disease and ischemic stroke.
You are almost at the end. Time to return to the previous chapter on post-GWAS analyses @ref(post_gwas).
So you took the hard way. You want to get your hands dirty yourself. Your choice.
As I wrote, we will use the TwoSampleMR package to assess the causal effect of type 2 diabetes on coronary artery disease and ischemic stroke.
Briefly, the workflow of a Two-Sample MR is as follows (see also diagram below):
TwoSampleMR is free use. It used to require server-authentication using a gmail-account, now it doesn’t anymore. Should run into an issue, please refer to this page.
First we load the package.
library("TwoSampleMR")
library("MRInstruments")
library("ieugwasr")
We are ready to load in the available outcomes.
ao_twosamplemr <- available_outcomes()
The MRInstruments-package, which comes with TwoSampleMR, includes several types of instruments:
Let’s select T2D associated instruments.
data("gwas_catalog")
t2d_gwas <- subset(gwas_catalog, grepl("Mahajan", Author) & PubmedID == "24509480" & Phenotype == "Type 2 diabetes")
t2d_exp_dat <- format_data(t2d_gwas)
Like with identifying independent GWAS hits, it is important for TwoSampleMR to ensure that the instruments for the exposure are independent. We can use MR Base to perform clumping. The European samples from the 1000 genomes project are used to estimate LD between SNPs. You can provide a list of SNP IDs, the SNPs will be extracted from 1000 genomes data, LD calculated between them, and amongst those SNPs that have LD R-square above the specified threshold only the SNP with the lowest P-value will be retained. To do this, use the following command.
t2d_exp_dat <- clump_data(t2d_exp_dat)
Right, so now we are ready to select the proper outcome, CAD. There are multiple datasets available in MRBase, the database used by TwoSampleMR. Let’s review them all and select one with the largest overlap with our exposure SNPs.
ao_twosamplemr[grepl("heart disease", ao_twosamplemr$trait), ]
## Rows: 31
## Columns: 25
## $ id <chr> "ukb-e-I25_CSA", "ukb-e-I25_AFR", "ukb-d-I9_IHD", "ukb-d-…
## $ trait <chr> "I25 Chronic ischaemic heart disease", "I25 Chronic ischa…
## $ group_name <chr> "public", "public", "public", "public", "public", "public…
## $ year <int> 2020, 2020, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201…
## $ author <chr> "Pan-UKB team", "Pan-UKB team", "Neale lab", "Neale lab",…
## $ sex <chr> "Males and Females", "Males and Females", "Males and Fema…
## $ population <chr> "South Asian", "African American or Afro-Caribbean", "Eur…
## $ unit <chr> "NA", "NA", "NA", "NA", "NA", "SD", "SD", "SD", "SD", "SD…
## $ sample_size <int> 8876, 6636, 361194, 361194, 361194, 463010, 463010, 46301…
## $ build <chr> "HG19/GRCh37", "HG19/GRCh37", "HG19/GRCh37", "HG19/GRCh37…
## $ subcategory <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA…
## $ category <chr> "Binary", "Binary", "Binary", "Binary", "Binary", "Binary…
## $ ontology <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA…
## $ doi <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ note <chr> "NA", "NA", "NA", "NA", "NA", "41204#I259: Output from GW…
## $ ncase <int> 1205, 302, 20857, 10157, 10157, 5861, 5771, 9330, 1195, 5…
## $ ncontrol <int> 7671, 6334, 340337, 351037, 351037, 457149, 457239, 45368…
## $ consortium <chr> "NA", "NA", "NA", "NA", "NA", "MRC-IEU", "MRC-IEU", "MRC-…
## $ mr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pmid <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 26343387,…
## $ nsnp <int> 9811287, 15478580, 13586589, 13295130, 13295130, 9851867,…
## $ coverage <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ study_design <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ priority <int> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 0, 0, 0, …
## $ sd <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
It turns out there are several options listed below.
| Code | Year | Author | Description | Ncases | Ncontrols |
|---|---|---|---|---|---|
| ukb-d-I9_IHD | 2018 | Neale Labe | Ischaemic heart disease, wide definition | 20857 | 340337 |
| ukb-d-I9_CHD_NOREV | 2018 | Neale Labe | Major coronary heart disease event excluding revascularizations 10157 351037 | ||
| ieu-a-7 | 2015 | Nikpay et al. | Coronary artery disease | 60801 123504 | |
| ieu-a-9 | 2013 | Deloukas et al. | Coronary artery disease | 63746 130681 | |
| ieu-a-8 | 2011 | Schunkert et al. | Coronary artery disease | 22233 | 64762 |
We will use the ieu-a-7 dataset as this provides the latest and the greatest, but at the same time (when you go into the literature and dig up the publication) you need a good and sharp definition of the outcome you are interested in.
chd_out_dat <- extract_outcome_data(
snps = t2d_exp_dat$SNP,
# outcomes = "UKB-a:534" # - this one is deprecated, use the one below
# Deprecated IDs being used? Detected numeric IDs. Trying to fix, but please note the changes below for future.
# UKB-a:534 -> ukb-a-534
outcomes = "ieu-a-7"
)
It is important to make sure that the effect allele associated to T2D of any given SNP is the same as the effect allele associated CAD. In other words, we need to harmonise the data.
dat <- harmonise_data(
exposure_dat = t2d_exp_dat,
outcome_dat = chd_out_dat
)
Now that the data is harmonised, we are ready to perform a MR and test for a causal relation of T2D with CAD.
res <- mr(dat)
DT::datatable(res, caption = "MR of type 2 diabetes and coronary artery disease", rownames = FALSE)
Question: What is the causal effect of T2D on CAD? Can you describe this using the knowledge from the lecture and the results in the table.
Several tests are applied. You can change the method(s) applied by using the method_list flag. Below a list of available methods.
knitr::kable(mr_method_list(), caption = "Available MR methods", row.names
= FALSE)
| obj | name | PubmedID | Description | use_by_default | heterogeneity_test |
|---|---|---|---|---|---|
| mr_wald_ratio | Wald ratio | TRUE | FALSE | ||
| mr_two_sample_ml | Maximum likelihood | FALSE | TRUE | ||
| mr_egger_regression | MR Egger | 26050253 | TRUE | TRUE | |
| mr_egger_regression_bootstrap | MR Egger (bootstrap) | 26050253 | FALSE | FALSE | |
| mr_simple_median | Simple median | FALSE | FALSE | ||
| mr_weighted_median | Weighted median | TRUE | FALSE | ||
| mr_penalised_weighted_median | Penalised weighted median | FALSE | FALSE | ||
| mr_ivw | Inverse variance weighted | TRUE | TRUE | ||
| mr_ivw_radial | IVW radial | FALSE | TRUE | ||
| mr_ivw_mre | Inverse variance weighted (multiplicative random effects) | FALSE | FALSE | ||
| mr_ivw_fe | Inverse variance weighted (fixed effects) | FALSE | FALSE | ||
| mr_simple_mode | Simple mode | TRUE | FALSE | ||
| mr_weighted_mode | Weighted mode | TRUE | FALSE | ||
| mr_weighted_mode_nome | Weighted mode (NOME) | FALSE | FALSE | ||
| mr_simple_mode_nome | Simple mode (NOME) | FALSE | FALSE | ||
| mr_raps | Robust adjusted profile score (RAPS) | FALSE | FALSE | ||
| mr_sign | Sign concordance test | Tests for concordance of signs between exposure and outcome | FALSE | FALSE | |
| mr_uwr | Unweighted regression | Doesn’t use any weights | FALSE | TRUE |
Many genetic variants are associated to multiple traits, thus pleiotropy is common and widespread in the human genome. This poses as an issue to tease apart causal relations in Mendelian randomisation. Vertical pleiotropy arises because a SNP influences one trait, which in turn influences another. We can use MR to estimate the causal influence between the traits. Horizontal pleiotropy arises due to SNPs influencing two traits through independent pathways(Hemani G. 2018).
Further, we should test for study heterogeneity.
mr_heterogeneity(dat)
## id.exposure id.outcome outcome exposure
## 1 D3ofMc ieu-a-7 Coronary heart disease || id:ieu-a-7 Type 2 diabetes
## 2 D3ofMc ieu-a-7 Coronary heart disease || id:ieu-a-7 Type 2 diabetes
## method Q Q_df Q_pval
## 1 MR Egger 82.1911 48 0.001545015
## 2 Inverse variance weighted 84.2921 49 0.001282616
mr_pleiotropy_test(dat)
## id.exposure id.outcome outcome exposure
## 1 D3ofMc ieu-a-7 Coronary heart disease || id:ieu-a-7 Type 2 diabetes
## egger_intercept se pval
## 1 0.004478786 0.004043328 0.2735122
Question: What do the results of the heterogeneity and pleiotropy tests tell you?
It is also key to inspect the causal effects per SNP and perform a leave-one-out analysis.
res_single <- mr_singlesnp(dat)
DT::datatable(res_single, caption = "Single SNP MR analysis", rownames = FALSE)
res_loo <- mr_leaveoneout(dat)
DT::datatable(res_loo, caption = "Leave-one-out MR analysis", rownames = FALSE)
Question: What do the results of the single SNP and leave-one-out analyses tell you?
It is important to present the causal inference graphically.
Scatter plot
res <- mr(dat)
p1 <- mr_scatter_plot(res, dat)
p1
## $`D3ofMc.ieu-a-7`
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## id.exposure id.outcome
## 1 D3ofMc ieu-a-7
Question: What do you see in the scatter plot? And what can you conclude from this?
Single SNP plot
res_single <- mr_singlesnp(dat)
p2 <- mr_forest_plot(res_single)
p2
## $`D3ofMc.ieu-a-7`
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## id.exposure id.outcome
## 1 D3ofMc ieu-a-7
Question: What do you see in the forest plot? And what can you conclude from this?
Leave-One-Out plot
res_loo <- mr_leaveoneout(dat)
p3 <- mr_leaveoneout_plot(res_loo)
p3
## $`D3ofMc.ieu-a-7`
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## id.exposure id.outcome
## 1 D3ofMc ieu-a-7
Question: what does a leave-one-out ploy tell you? And what can you conclude from this?
Funnel plot
res_single <- mr_singlesnp(dat)
p4 <- mr_funnel_plot(res_single)
p4
## $`D3ofMc.ieu-a-7`
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## id.exposure id.outcome
## 1 D3ofMc ieu-a-7
Question: Why would yoy use a funnel plot? And what can it tell you?
Question: What are your main conclusions from the MR analysis?
You are almost at the end. Time to return to the previous chapter on post-GWAS analyses @ref(post_gwas).
What started as a simple ‘let’s write a practical on how to do a GWAS’, escalated into this GitBook. My second book. My fifth child (as far as I know). I hope you found it to be useful and learned a bit.
That said, much can be improved and so don’t hesitate to contact me.
As with any proper epilogue I should not forget a heartfelt and honest thank you to the readers and users of this work. Gratitude also goes to my dear colleagues Charlotte Onland, Jessica van Setten, and Kristel van Eijk, and former colleague Sara Pulit who asked me back in 2017 to join the course as a lecturer. It has been a pleasure to work with and learn from you, and it has been a fun (and sometimes stressful) experience to teach this course.
| Name | |
|---|---|
| s.w.vanderlaan-2@umcutrecht.nlvanderlaanand.science | |
| N.C.Onland@umcutrecht.nl | |
| j.vansetten@umcutrecht.nl | |
| K.vanEijk-2@umcutrecht.nl | |
| M.Mokry@umcutrecht.nl | |
| E.J.A.Smulders-2@umcutrecht.nl |
First we need to find the independent hits by clumping the results. We will just use the defaults, but please take a note of all the options here https://www.cog-genomics.org/plink/1.9/postproc#clump
plink --bfile ~/data/shared/gwas/gwa --clump dummy_project/data.assoc.logistic --clump-p1 5e-8 --clump-p2 0.05 --clump-kb 500 --clump-r2 0.05 --clump-verbose --out dummy_project/data.assoc.logistic
Now you will have a list of all the independent SNPs, i.e. the genetic loci, that are associated to the trait.
cat dummy_project/data.assoc.logistic.clumped
##
## CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001
## 3 1 rs6802898 12366207 4.18e-20 50 35 4 2 1 8
##
## KB RSQ ALLELES F P
## (INDEX) rs6802898 0 1.000 T 1 4.18e-20
##
## rs305500 -400 0.0588 TC/CA 1 0.0476
## rs420014 -394 0.0552 TA/CG 1 0.015
## rs305494 -392 0.0681 TG/CA 1 0.025
## rs438129 -383 0.0721 TT/CC 1 0.0126
## rs7615580 -364 0.309 TC/CT 1 2.05e-08
## rs307560 -298 0.153 TT/CC 1 1.68e-06
## rs11720130 -235 0.0838 TA/CG 1 0.00218
## rs7616006 -124 0.0831 TA/CG 1 5.25e-05
## rs6775191 -119 0.0506 TG/CA 1 0.000182
## rs167466 -110 0.0523 TT/CC 1 0.00222
## rs12635120 -86.6 0.332 TG/CA 1 2.77e-07
## rs6798713 -85.6 0.288 TC/CT 1 2.01e-06
## rs2920500 -67.8 0.102 TA/CG 1 6.59e-05
## rs6768587 -53.1 0.295 TG/CA 1 3.36e-08
## rs2028760 -18.3 0.305 TA/CG 1 3.67e-08
##
## RANGE: chr3:11966007..12366207
## SPAN: 400kb
##
## ------------------------------------------------------------------
##
##
## CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001
## 10 1 rs7901695 114744078 6.78e-12 32 24 2 1 1 4
##
## KB RSQ ALLELES F P
## (INDEX) rs7901695 0 1.000 C 1 6.78e-12
##
## rs7917983 -21.2 0.0589 CC/TT 1 0.046
## rs7895307 -10.1 0.0819 CG/TA 1 0.00592
## rs7903146 4.26 0.784 CT/TC 1 3.25e-08
## rs7904519 19.8 0.582 CG/TA 1 2.89e-08
## rs11196192 28.2 0.162 CG/TT 1 0.0268
## rs10885409 54 0.502 CC/TT 1 8.35e-06
## rs12255372 54.8 0.624 CT/TG 1 1.55e-06
## rs4918789 67.7 0.24 CG/TT 1 0.000248
##
## RANGE: chr10:114722872..114811797
## SPAN: 88kb
##
## ------------------------------------------------------------------
##
##
## CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001
## 16 1 rs8050136 52373776 1.52e-08 23 16 1 3 2 1
##
## KB RSQ ALLELES F P
## (INDEX) rs8050136 0 1.000 A 1 1.52e-08
##
## rs7205986 -61.1 0.226 AG/CA 1 0.0434
## rs6499640 -46.6 0.258 AA/CG 1 0.00367
## rs1861868 -25.9 0.15 AT/CC 1 0.0063
## rs1075440 -25.4 0.162 AA/CG 1 0.00802
## rs3751812 2.18 0.994 AT/CG 1 1.63e-08
## rs7190492 12.5 0.258 AG/CA 1 0.0007
## rs8044769 22.9 0.524 AC/CT 1 0.000611
##
## RANGE: chr16:52312647..52396636
## SPAN: 83kb
##
## ------------------------------------------------------------------
Question: How many independent hits did you find?
In summary, clumping identifies three loci and now that you know them you can create regional association plots.
Now you are ready to visualize regions of interest using a package like LocusZoom. Locuszoom is a great and easy tool, you’ll soon discover. It’s the fast and lazy way to get regional association plots.
First, let’s get what we need (SNP and P) and gzip the results.
echo "SNP P" > dummy_project/data.assoc.logistic.locuszoom
cat dummy_project/data.assoc.logistic | awk '$5=="ADD"' | awk '{ print $2, $9 }' >> dummy_project/data.assoc.logistic.locuszoom
gzip -v dummy_project/data.assoc.logistic.locuszoom
You can to upload this data.assoc.logistic.locuszoom.gz Try to visualize each locus using the information above and by following the instructions. Choose HapMap 2, hg18, CEU as the LD-reference.
You should get something like below.
An alternative to create regional association plots for each of these loci is using RACER. This packages offers substantial flexibility, and quite frankly, code-wise is easier to edit - think colorscheme, background etc. - as it is based on ggplot2. The creator, Olivia Sabik, also wrote a nice vignette with excellent instructions.
Using RACER is also a step-up to colocalization about which you read a bit more in chapter @ref(post_gwas).
Right, so from the above we identified three independent hits: rs6802898, rs7901695, rs8050136. Let’s put them in a dataset with their chromosome and basepair position and create a list.
## rsID CHR BP
## 1 rs6802898 3 12366207
## 2 rs7901695 10 114744078
## 3 rs8050136 16 52373776
## [1] "rs6802898" "rs7901695" "rs8050136"
We can now take the filtered and prepared GWAS summary statistics, i.e. gwas_assoc_compl you created earlier in chapter @ref(gwas_visuals), to draw three nice regional association plots around these variants.
This file should contain the following:
tibble [306,102 × 14] (S3: tbl_df/tbl/data.frame)
$ SNP : chr [1:306102] "rs10000010" "rs10000023" "rs10000030" "rs1000007" ...
$ CHR : int [1:306102] 4 4 4 2 4 4 16 4 2 4 ...
$ BP : int [1:306102] 21227772 95952929 103593179 237416793 21504615 157793485 24325037 33810744 235355721 77575270 ...
$ A1 : chr [1:306102] "C" "T" "A" "C" ...
$ A2 : chr [1:306102] "T" "G" "G" "T" ...
$ MAF : num [1:306102] 0.426 0.484 0.162 0.312 0.343 ...
$ callrate: num [1:306102] 0.999 0.989 0.998 1 0.991 ...
$ NMISS : int [1:306102] 3996 3957 3991 4000 3963 3919 3899 3931 3927 3941 ...
$ NCHROBS : int [1:306102] 7992 7914 7982 8000 7926 7838 7798 7862 7854 7882 ...
$ BETA : num [1:306102] 0.04114 -0.00985 -0.02235 0.01784 -0.07904 ...
$ SE : num [1:306102] 0.0457 0.0456 0.0605 0.0489 0.0471 ...
$ OR : num [1:306102] 1.042 0.99 0.978 1.018 0.924 ...
$ STAT : num [1:306102] 0.901 -0.216 -0.37 0.365 -1.677 ...
$ P : num [1:306102] 0.3676 0.829 0.7117 0.7152 0.0935 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "SNP"
We could limit ourselves by limiting the region to plot by the clump-size (see above), but generally it’s fine to just ‘take the top variant ± 500kb’.
This should produce the figure similar to these below.Figure 19.1: Regional association plots.
Please note that we are letting RACER plot the variants on genome build 37 (hg19), but we actually provide hg18-based chromosome and basepair positions. The proper thing to do is to liftOver our coordinates to match those in hg19. Here we didn’t, because I just wanted to show you the ‘how it’s done’.
There used to be a time that the preferred software EIGENSOFT for Principal Component Analysis (PCA) was EIGENSOFT For many, it still is. However, EIGENSOFT is a bit challenging to make it work to say the least. You need to install some programs, and this is not always straightforward.
So, here’s the deal.
I will share the how-to for a macOS environment below in EIGENSOFT - this should work in a Linux environment too as macOS is UNIX-based. You can choose to try and make it work on your system (be it UNIX or macOS based) at home (or in the office).
However, I recommend that you use the --pca function which is present in PLINK v1.9 and up (I describe this in the Chapter ‘Sample QC’.
You need to install brew, the missing package-manager and accompanying packages that Apple didn’t provide.
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Next, check that everything is in order.
brew doctor
Right, now that you’ve done that, you’re ready to install gsl and openblas.
brew install gsl
brew install openblas
You may also require llvm for EIGENSOFT to work.
brew install llvm
I am still sharing the code you’ll need - you could try this on your system.
mkdir -v $HOME/git
cd $HOME/git
git clone https://github.com/DReichLab/EIG.git
cd EIG/src
make
make install
Throughout the book you encountered questions related to some steps and output that you generated. Here you can review the answers for each of these questions - of course, for some different answers are possible.
Note: Answers to all the questions will come at a later stage during the course.
Question: Can you think off what the ‘11,440 heterozygous haploid genotypes present’ represent?
Question: Can you think of other scenarios in which you may want to extend the check on differential missingness beyond a check between cases and controls? * For instance ‘genotyping platform’, or ‘hospital of inclusion’, if you think this might influence the genotyping experiment technically.
Question: Load the data using R. [Hint: use and adapt the examples from the previous chapters.]
library("data.table")
gwas_HWE <- data.table::fread("dummy_project/gwa.hwe")
gwas_FRQ <- data.table::fread("dummy_project/gwa.frq")
gwas_IMISS <- data.table::fread("dummy_project/gwa.imiss")
gwas_LMISS <- data.table::fread("dummy_project/gwa.lmiss")
# you'll need to calculate the -log10(P) for the HWE p-values
gwas_HWE$logP <- -log10(gwas_HWE$P)
# you'll need to calculate the sample call rate
gwas_IMISS$callrate <- 1 - gwas_IMISS$F_MISS
# you'll need to calculate the SNP call rate
gwas_LMISS$callrate <- 1 - gwas_LMISS$F_MISS
Question: Plot the per-stratum HWE p-values using R. [Hint: use and adapt the examples from the previous chapters.]
library("ggpubr")
ggpubr::gghistogram(gwas_HWE, x = "logP",
add = "mean",
add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
rug = TRUE,
# color = "#1290D9", fill = "#1290D9",
color = "TEST", fill = "TEST",
palette = "lancet",
facet.by = "TEST",
bins = 50,
xlab = "HWE -log10(P)") +
ggplot2::geom_vline(xintercept = 5, linetype = "dashed",
color = "#E55738", size = 1)
ggplot2::ggsave("dummy_project/gwas-hwe.png",
plot = last_plot())
Question: Plot the allele frequencies using R. [Hint: use and adapt the examples from the previous chapters.]
ggpubr::gghistogram(gwas_FRQ, x = "MAF",
add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
rug = TRUE,
color = "#1290D9", fill = "#1290D9",
xlab = "minor allele frequency") +
ggplot2::geom_vline(xintercept = 0.05, linetype = "dashed",
color = "#E55738", size = 1)
ggplot2::ggsave("dummy_project/gwas-freq.png",
plot = last_plot())
Question: Plot the per-sample call rates using R. [Hint: use and adapt the examples from the previous chapters.]
ggpubr::gghistogram(gwas_IMISS, x = "callrate",
add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
rug = TRUE, bins = 50,
color = "#1290D9", fill = "#1290D9",
xlab = "per sample call rate") +
ggplot2::geom_vline(xintercept = 0.95, linetype = "dashed",
color = "#E55738", size = 1)
ggplot2::ggsave("data/dummy_project/gwas-sample-call-rate.png",
plot = last_plot())
Question: Plot the per-SNP call rates using R. [Hint: use and adapt the examples from the previous chapters.]
ggpubr::gghistogram(gwas_LMISS, x = "callrate",
add = "mean", add.params = list(color = "#595A5C", linetype = "dashed", size = 1),
rug = TRUE, bins = 50,
color = "#1290D9", fill = "#1290D9",
xlab = "per SNP call rate") +
ggplot2::geom_vline(xintercept = 0.95, linetype = "dashed",
color = "#E55738", size = 1)
ggplot2::ggsave("data/dummy_project/gwas-snp-call-rate.png",
plot = last_plot())
Question: Do you know which model, i.e.
TESTis most commonly used and reported? And why is that, do think?
Question: How come there are more lines in this file than there are variants?
Question: Can you write down the mathematical relation between beta and OR?
Question: Try to figure out how to get the number of variants per chromosomes. Why do the number of variants per chrosome (approximately) correlate with the chromosome number?
Question: Where are the data for chromosome X, Y and MT?
Question: What do the grey spots on the density plot indicate?
Question: What do you notice in the ‘per sample call rate’ graph? Can you think of a reason why this is? And how would you deal with this?
Question: Do you have any thoughts on that? Do you agree with the filters I set below? How would you do it differently and why?
Question: can you figure out the sample size of the WTCCC1 data you used?
Questions
- How many lead SNPs did we find?
- What do the results of MAGMA (the gene-based test) look like and how many genes pass the threshold of multiple testing correction?
- How many loci were mapped?
- How many genes were physically located and how many were mapped to these loci?
- Do you think all loci are ‘correct’, i.e. do you ‘believe’ all the signals looking at the mapping results? Why?
- For what tissues are the signals enriched?
- Are there any chromatin interactions discovered?
Questions
- What genes show the lowest expression across tissues?
- And what genes the highest?
- For what pathways are the signals enriched?
- What molecular functions are mapped to the signals?
Question: What is the causal effect of T2D on CAD? Can you describe this using the knowledge from the lecture and the results in the table. Question: What do the results of the heterogeneity and pleiotropy tests tell you? Question: What do the results of the single SNP and leave-one-out analyses tell you? Question: What do you see in the scatter plot? And what can you conclude from this? Question: What do you see in the forest plot? And what can you conclude from this? Question: What does a leave-one-out ploy tell you? And what can you conclude from this? Question: Why would yoy use a funnel plot? And what can it tell you? Question: What are your main conclusions from the MR analysis?
Question: How many independent hits did you find?
I took the at-most care to use refer to the original works and data sources where needed. Likewise, all the images c.q. figures are either produced specifically for this book, or I took them from Unsplash to brighten up the book. If you feel I made a mistake and your work should be properly referenced, please don’t hesitate to contact me.
These are the images from Unsplash listed here in no particular order.
The content contained herein is provided only for educational and informational purposes or as required by Dutch law. The author attempted to ensure that content is accurate and obtained from reliable sources, but does not represent it to be error-free. The author may add, amend or repeal any text, procedure or regulation, and failure to timely post such changes to this book shall not be construed as a waiver of enforcement. The author does not warrant that any functions on this website or the contents and references herein will be uninterrupted, that defects will be corrected, or that this website or the contents and references will be free from viruses or other harmful components. Any links to third party information on the author’s website are provided as a courtesy and do not constitute an endorsement of those materials or the third party providing them.
This book and all its material (“content”) is protected by copyright under Dutch Copyright laws and is the property of the author or the party credited as the provider of the content. You may not copy, reproduce, distribute, publish, display, perform, modify, create derivative works, transmit, or in any way exploit any such content, nor may you distribute any part of this content over any network, including a local area network, sell or offer it for sale, or use such content to construct any kind of database. You may not alter or remove any copyright or other notice from copies of the content on this website. Copying or storing any content except as provided above is expressly prohibited without prior written permission of the author or the copyright holder identified in the individual content’s copyright notice. For permission to use this content, please contact the author.
Copyright 1979-2024. All rights reserved. Sander W. van der Laan | s.w.vanderlaan [at] gmail.com | https://vanderlaanand.science. Published with bookdown.
The 2022 and 2024 editions of this book were produce in RStudio and with the bookdown package. Below a listing of installed programs and libraries, the operating system, and their specific versions.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.3.3 (2024-02-29)
## os macOS Sonoma 14.5
## system x86_64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2024-04-08
## pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [2] CRAN (R 4.3.0)
## annotate * 1.80.0 2023-10-24 [2] Bioconductor
## AnnotationDbi * 1.64.1 2023-11-03 [2] Bioconductor
## askpass 1.2.0 2023-09-03 [2] CRAN (R 4.3.0)
## backports 1.4.1 2021-12-13 [2] CRAN (R 4.3.0)
## Biobase * 2.62.0 2023-10-24 [2] Bioconductor
## BiocGenerics * 0.48.1 2023-11-01 [2] Bioconductor
## Biostrings 2.70.3 2024-03-13 [2] Bioconductor 3.18 (R 4.3.3)
## bit 4.0.5 2022-11-15 [2] CRAN (R 4.3.0)
## bit64 4.0.5 2020-08-30 [2] CRAN (R 4.3.0)
## bitops 1.0-7 2021-04-24 [2] CRAN (R 4.3.0)
## blob 1.2.4 2023-03-17 [2] CRAN (R 4.3.0)
## bookdown * 0.38.1 2024-03-26 [2] Github (rstudio/bookdown@50a1c1e)
## broom 1.0.5 2023-06-09 [2] CRAN (R 4.3.0)
## bslib 0.6.2 2024-03-22 [2] CRAN (R 4.3.2)
## cachem 1.0.8 2023-05-01 [2] CRAN (R 4.3.0)
## calibrate 1.7.7 2020-06-19 [2] CRAN (R 4.3.0)
## car 3.1-2 2023-03-30 [2] CRAN (R 4.3.0)
## carData 3.0-5 2022-01-06 [2] CRAN (R 4.3.0)
## chromote 0.2.0 2024-02-12 [1] CRAN (R 4.3.2)
## cli 3.6.2 2023-12-11 [2] CRAN (R 4.3.0)
## CMplot * 4.5.1 2024-01-19 [2] CRAN (R 4.3.0)
## colorspace 2.1-0 2023-01-23 [2] CRAN (R 4.3.0)
## crayon 1.5.2 2022-09-29 [2] CRAN (R 4.3.0)
## crul 1.4.0 2023-05-17 [2] CRAN (R 4.3.0)
## curl 5.2.1 2024-03-01 [2] CRAN (R 4.3.2)
## data.table * 1.15.4 2024-03-30 [1] CRAN (R 4.3.2)
## DBI 1.2.2 2024-02-16 [2] CRAN (R 4.3.2)
## devtools * 2.4.5 2022-10-11 [2] CRAN (R 4.3.0)
## digest 0.6.35 2024-03-11 [2] CRAN (R 4.3.2)
## dplyr * 1.1.4 2023-11-17 [2] CRAN (R 4.3.0)
## DT * 0.33 2024-04-04 [1] CRAN (R 4.3.2)
## ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.3.0)
## evaluate 0.23 2023-11-01 [2] CRAN (R 4.3.0)
## fansi 1.0.6 2023-12-08 [2] CRAN (R 4.3.0)
## fastmap 1.1.1 2023-02-24 [2] CRAN (R 4.3.0)
## flextable * 0.9.5 2024-03-06 [1] CRAN (R 4.3.2)
## fontBitstreamVera 0.1.1 2017-02-01 [2] CRAN (R 4.3.0)
## fontLiberation 0.1.0 2016-10-15 [2] CRAN (R 4.3.0)
## fontquiver 0.2.1 2017-02-01 [2] CRAN (R 4.3.0)
## forcats * 1.0.0 2023-01-29 [2] CRAN (R 4.3.0)
## foreign 0.8-86 2023-11-28 [2] CRAN (R 4.3.3)
## formatR * 1.14 2023-01-17 [2] CRAN (R 4.3.0)
## fs 1.6.3 2023-07-20 [2] CRAN (R 4.3.0)
## gdtools 0.3.7 2024-03-05 [2] CRAN (R 4.3.2)
## geneplotter * 1.80.0 2023-10-24 [2] Bioconductor
## generics 0.1.3 2022-07-05 [2] CRAN (R 4.3.0)
## GenomeInfoDb 1.38.8 2024-03-15 [2] Bioconductor 3.18 (R 4.3.3)
## GenomeInfoDbData 1.2.11 2024-03-26 [2] Bioconductor
## gfonts 0.2.0 2023-01-08 [2] CRAN (R 4.3.0)
## ggplot2 * 3.5.0 2024-02-23 [2] CRAN (R 4.3.2)
## ggpubr * 0.6.0.999 2024-03-26 [2] Github (kassambara/ggpubr@6aeb4f7)
## ggsci * 3.0.3 2024-03-25 [2] CRAN (R 4.3.2)
## ggsignif 0.6.4 2022-10-13 [2] CRAN (R 4.3.0)
## ggthemes * 5.1.0 2024-02-10 [2] CRAN (R 4.3.2)
## glue 1.7.0 2024-01-09 [2] CRAN (R 4.3.0)
## gtable 0.3.4 2023-08-21 [2] CRAN (R 4.3.0)
## hms 1.1.3 2023-03-21 [2] CRAN (R 4.3.0)
## htmltools 0.5.8 2024-03-25 [2] CRAN (R 4.3.2)
## htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.3.0)
## httpcode 0.3.0 2020-04-10 [2] CRAN (R 4.3.0)
## httpuv 1.6.15 2024-03-26 [2] CRAN (R 4.3.2)
## httr 1.4.7 2023-08-15 [2] CRAN (R 4.3.0)
## ieugwasr * 0.2.2-9000 2024-04-05 [1] Github (MRCIEU/ieugwasr@52e03a7)
## IRanges * 2.36.0 2023-10-24 [2] Bioconductor
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.3.0)
## jsonlite 1.8.8 2023-12-04 [2] CRAN (R 4.3.0)
## kableExtra * 1.4.0 2024-01-24 [1] CRAN (R 4.3.2)
## KEGGREST 1.42.0 2023-10-24 [2] Bioconductor
## knitr * 1.45 2023-10-30 [1] CRAN (R 4.3.0)
## later 1.3.2 2023-12-06 [2] CRAN (R 4.3.0)
## lattice * 0.22-6 2024-03-20 [2] CRAN (R 4.3.2)
## lazyeval 0.2.2 2019-03-15 [2] CRAN (R 4.3.0)
## lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.3.0)
## lubridate * 1.9.3 2023-09-27 [2] CRAN (R 4.3.0)
## magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.3.0)
## MASS 7.3-60.0.1 2024-01-13 [2] CRAN (R 4.3.3)
## Matrix 1.6-5 2024-01-11 [2] CRAN (R 4.3.3)
## memoise 2.0.1 2021-11-26 [2] CRAN (R 4.3.0)
## mime 0.12 2021-09-28 [2] CRAN (R 4.3.0)
## miniUI 0.1.1.1 2018-05-18 [2] CRAN (R 4.3.0)
## mitools 2.4 2019-04-26 [1] CRAN (R 4.3.0)
## MRInstruments * 0.3.2 2024-03-27 [1] Github (MRCIEU/MRInstruments@efa2ca0)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.3.2)
## officer 0.6.5 2024-02-24 [2] CRAN (R 4.3.2)
## openssl 2.1.1 2023-09-25 [2] CRAN (R 4.3.0)
## openxlsx * 4.2.5.2 2023-02-06 [2] CRAN (R 4.3.0)
## pillar 1.9.0 2023-03-22 [2] CRAN (R 4.3.0)
## pkgbuild 1.4.4 2024-03-17 [2] CRAN (R 4.3.2)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.3.0)
## pkgload 1.3.4 2024-01-16 [2] CRAN (R 4.3.0)
## plotly * 4.10.4 2024-01-13 [2] CRAN (R 4.3.0)
## png 0.1-8 2022-11-29 [2] CRAN (R 4.3.0)
## processx 3.8.4 2024-03-16 [2] CRAN (R 4.3.2)
## profvis 0.3.8 2023-05-02 [2] CRAN (R 4.3.0)
## promises 1.2.1 2023-08-10 [2] CRAN (R 4.3.0)
## ps 1.7.6 2024-01-18 [2] CRAN (R 4.3.0)
## purrr * 1.0.2 2023-08-10 [2] CRAN (R 4.3.0)
## qqman * 0.1.9 2023-08-23 [2] CRAN (R 4.3.0)
## R6 2.5.1 2021-08-19 [2] CRAN (R 4.3.0)
## RACER * 1.0.0 2024-03-26 [2] Github (oliviasabik/RACER@1394c9d)
## ragg 1.3.0 2024-03-13 [2] CRAN (R 4.3.2)
## RColorBrewer 1.1-3 2022-04-03 [2] CRAN (R 4.3.0)
## Rcpp 1.0.12 2024-01-09 [2] CRAN (R 4.3.0)
## RCurl 1.98-1.14 2024-01-09 [2] CRAN (R 4.3.0)
## readr * 2.1.5 2024-01-10 [2] CRAN (R 4.3.0)
## remotes * 2.5.0 2024-03-17 [2] CRAN (R 4.3.2)
## rlang 1.1.3 2024-01-10 [2] CRAN (R 4.3.0)
## rmarkdown * 2.26.1 2024-03-26 [2] Github (rstudio/rmarkdown@ee69d59)
## RNHANES * 1.1.1 2024-03-27 [1] Github (silentspringinstitute/RNHANES@860fc94)
## RSQLite 2.3.5 2024-01-21 [2] CRAN (R 4.3.0)
## rstatix 0.7.2 2023-02-01 [2] CRAN (R 4.3.0)
## rstudioapi 0.16.0 2024-03-24 [2] CRAN (R 4.3.2)
## rvest 1.0.4 2024-02-12 [2] CRAN (R 4.3.2)
## S4Vectors * 0.40.2 2023-11-23 [2] Bioconductor
## sass 0.4.9 2024-03-15 [2] CRAN (R 4.3.2)
## scales 1.3.0 2023-11-28 [2] CRAN (R 4.3.0)
## sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.3.0)
## shiny 1.8.1 2024-03-26 [2] CRAN (R 4.3.2)
## stringi 1.8.3 2023-12-11 [2] CRAN (R 4.3.0)
## stringr * 1.5.1 2023-11-14 [2] CRAN (R 4.3.0)
## survey 4.4-2 2024-03-20 [1] CRAN (R 4.3.2)
## survival 3.5-8 2024-02-14 [2] CRAN (R 4.3.3)
## svglite 2.1.3 2023-12-08 [1] CRAN (R 4.3.0)
## systemfonts 1.0.6 2024-03-07 [2] CRAN (R 4.3.2)
## textshaping 0.3.7 2023-10-09 [2] CRAN (R 4.3.0)
## tibble * 3.2.1 2023-03-20 [2] CRAN (R 4.3.0)
## tidyr * 1.3.1 2024-01-24 [2] CRAN (R 4.3.2)
## tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.3.2)
## tidyverse * 2.0.0 2023-02-22 [2] CRAN (R 4.3.0)
## timechange 0.3.0 2024-01-18 [2] CRAN (R 4.3.0)
## tinytex * 0.50 2024-03-16 [2] CRAN (R 4.3.2)
## TwoSampleMR * 0.5.11 2024-03-26 [2] Github (MRCIEU/TwoSampleMR@ebfeb95)
## tzdb 0.4.0 2023-05-12 [2] CRAN (R 4.3.0)
## urlchecker 1.0.1 2021-11-30 [2] CRAN (R 4.3.0)
## usethis * 2.2.3 2024-02-19 [2] CRAN (R 4.3.2)
## utf8 1.2.4 2023-10-22 [2] CRAN (R 4.3.0)
## uuid 1.2-0 2024-01-14 [2] CRAN (R 4.3.0)
## vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.3.0)
## viridisLite 0.4.2 2023-05-02 [2] CRAN (R 4.3.0)
## webshot * 0.5.5 2023-06-26 [1] CRAN (R 4.3.0)
## webshot2 * 0.1.1 2023-08-11 [1] CRAN (R 4.3.0)
## websocket 1.4.1 2021-08-18 [1] CRAN (R 4.3.0)
## withr 3.0.0 2024-01-16 [2] CRAN (R 4.3.0)
## xfun 0.43 2024-03-25 [2] CRAN (R 4.3.2)
## XML * 3.99-0.16.1 2024-01-22 [2] CRAN (R 4.3.2)
## xml2 * 1.3.6 2023-12-04 [2] CRAN (R 4.3.0)
## xtable 1.8-4 2019-04-21 [2] CRAN (R 4.3.0)
## XVector 0.42.0 2023-10-24 [2] Bioconductor
## yaml 2.3.8 2023-12-11 [2] CRAN (R 4.3.0)
## zip 2.3.1 2024-01-27 [2] CRAN (R 4.3.2)
## zlibbioc 1.48.2 2024-03-13 [2] Bioconductor 3.18 (R 4.3.3)
##
## [1] /Users/slaan3/Library/R/x86_64/4.3/library
## [2] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────